如何在字符串中提取所有锚标记,其href及其锚文本?

Rya*_*yan 8 php regex preg-replace domdocument preg-match

我需要以几种不同的方式处理html字符串中的链接.

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
        <a href="/local/path" title="with attributes">number</a> of
        <a href="#anchor" data-attr="lots">links</a>.'
$links = extractLinks($str);
foreach ($links as $link) {
    $pattern = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
    if (preg_match($pattern,$str)) {
        // Process Remote links
        //   For example, replace url with short url,
        //   or replace long anchor text with truncated
    } else {
        // Process Local Links, Anchors

    }
}
function extractLinks($str) {
    // First, I tried DomDocument
    $dom = new DomDocument();
    $dom->loadHTML($str);
    return $dom->getElementsByTagName('a');
    // But this just returns:
    //   DOMNodeList Object
    //   (
    //       [length] => 3
    //   )

    // Then I tried Regex
    if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $str, $matches)) {
        print_r($matches);
    }
    // But this didn't work either.
}
Run Code Online (Sandbox Code Playgroud)

期望的结果extractLinks($str):

[0] => Array(
           'str' = '<a href="http://example.com/abc" rel="link">string</a>',
           'href' = 'http://example.com/abc';
           'anchorText' = 'string'
       ),
[1] => Array(
           'str' = '<a href="/local/path" title="with attributes">number</a>',
           'href' = '/local/path';
           'anchorText' = 'number'
       ),
[2] => Array(
           'str' = '<a href="#anchor" data-attr="lots">links</a>',
           'href' = '#anchor';
           'anchorText' = 'links'
       );
Run Code Online (Sandbox Code Playgroud)

我需要所有这些,所以我可以做一些事情,比如编辑href(添加跟踪,缩短等),或用其他东西替换整个标签(<a href="/u/username">username</a>可能会变成username).

这是我正在尝试做的演示.

Jav*_*vad 13

您只需将其更改为:

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
    <a href="/local/path" title="with attributes">number</a> of
    <a href="#anchor" data-attr="lots">links</a>.';

$dom = new DomDocument();
$dom->loadHTML($str);
$output = array();
foreach ($dom->getElementsByTagName('a') as $item) {
   $output[] = array (
      'str' => $dom->saveHTML($item),
      'href' => $item->getAttribute('href'),
      'anchorText' => $item->nodeValue
   );
}
Run Code Online (Sandbox Code Playgroud)

通过把它在一个循环和使用getAttribute,nodeValue以及saveHTML(THE_NODE)你将有你的输出

  • @Ryan我更新了我的答案,它应该是`$dom-&gt;saveHTML($item)`而不是`$item-&gt;saveHTML($item)` (2认同)

Han*_*ler 5

像这样

<a\s*href="([^"]+)"[^>]+>([^<]+)</a>
Run Code Online (Sandbox Code Playgroud)
  1. 总体匹配是您想要的0数组元素
  2. 组1捕获是您想要的1个数组元素
  3. 组2捕获是您想要2数组元素的目的

使用 preg_match($pattern,$string,$m)

数组元素将在 $m[0] $m[1] $m[3]

在这里工作的PHP演示

$string = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
        <a href="/local/path" title="with attributes">number</a> of
        <a href="#anchor" data-attr="lots">links</a>. ';
$regex='|<a\s*href="([^"]+)"[^>]+>([^<]+)</a>|';
$howmany = preg_match_all($regex,$string,$res,PREG_SET_ORDER);
print_r($res);
Run Code Online (Sandbox Code Playgroud)