如何替换HTML标记中的文本URL和排除URL?

And*_*dri 13 html php regex url

我需要你帮忙.

我想转此:

sometext sometext http://www.somedomain.com/index.html sometext sometext
Run Code Online (Sandbox Code Playgroud)

成:

sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext
Run Code Online (Sandbox Code Playgroud)

我使用这个正则表达式管理它:

preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);
Run Code Online (Sandbox Code Playgroud)

问题是它还替换了imgURL,例如:

sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext
Run Code Online (Sandbox Code Playgroud)

变成了:

sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext
Run Code Online (Sandbox Code Playgroud)

请帮忙.

Gor*_*don 7

精简版Gumbo上面:

$html = <<< HTML
<html>
<body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;
Run Code Online (Sandbox Code Playgroud)

让我们使用一个XPath,它只获取那些实际上是包含http://或https://或ftp://的文本节点的元素,而这些元素本身并不是锚元素的文本节点.

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
    '/html/body//text()[
        not(ancestor::a) and (
        contains(.,"http://") or
        contains(.,"https://") or
        contains(.,"ftp://") )]'
);
Run Code Online (Sandbox Code Playgroud)

上面的XPath将为我们提供一个包含以下数据的TextNode:

 and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like 
Run Code Online (Sandbox Code Playgroud)

从PHP5.3开始,我们也可以在XPath中使用PHP来使用Regex模式来选择我们的节点而不是三次调用contains.

我们将使用文档片段而不是将整个textnode替换为片段,而不是以符合标准的方式拆分文本节点.在这种情况下,非标准仅意味着,我们将使用的方法不是DOM APIW3C规范的一部分.

foreach ($texts as $text) {
    $fragment = $dom->createDocumentFragment();
    $fragment->appendXML(
        preg_replace(
            "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
            '<a href="$1">$1</a>',
            $text->data
        )
    );
    $text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);
Run Code Online (Sandbox Code Playgroud)

然后输出:

<html><body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another <a href="http://example.com">http://example.com</a> with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body></html>
Run Code Online (Sandbox Code Playgroud)


Gum*_*mbo 4

您不应该\xe2\x80\x99t 使用正则表达式\xe2\x80\x93 来执行此操作,至少不应该仅使用正则表达式。请使用适当的 HTML DOM 解析器,例如PHP\xe2\x80\x99s DOM 库之一。然后,您可以迭代节点,检查它\xe2\x80\x99 是否是文本节点,并执行正则表达式搜索并适当地替换文本节点。

\n\n

像这样的事情应该这样做:

\n\n
$pattern = "~((?:http|https|ftp)://(?:\\S*?\\.\\S*?))(?=\\s|\\;|\\)|\\]|\\[|\\{|\\}|,|\\"|\'|:|\\<|$|\\.\\s)~i";\n$doc = new DOMDocument();\n$doc->loadHTML($str);\n// for every element in the document\nforeach ($doc->getElementsByTagName(\'*\') as $elem) {\n    // for every child node in each element\n    foreach ($elem->childNodes as $node) {\n        if ($node->nodeType === XML_TEXT_NODE) {\n            // split the text content to get an array of 1+2*n elements for n URLs in it\n            $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);\n            $n = count($parts);\n            if ($n > 1) {\n                $parentNode = $node->parentNode;\n                // insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node\n                for ($i=1; $i<$n; $i+=2) {\n                    $a = $doc->createElement(\'a\');\n                    $a->setAttribute(\'href\', $parts[$i]);\n                    $a->setAttribute(\'target\', \'_blank\');\n                    $a->appendChild($doc->createTextNode($parts[$i]));\n                    $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);\n                    $parentNode->insertBefore($a, $node);\n                }\n                // insert the last part before the original DOMText node\n                $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);\n                // remove the original DOMText node\n                $node->parentNode->removeChild($node);\n            }\n        }\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

好的,由于和的DOMNodeList \xe2\x80\x8ds活动的,DOM 中的每个更改都会反映到该列表,因此您不能使用它来迭代新添加的节点。相反,您需要使用循环并跟踪添加的元素以增加索引指针,并且最多适当地预先计算数组边界。getElementsByTagNamechildNodesforeachfor

\n\n

但由于这在如此复杂的算法中相当困难(三个循环中的每一个都需要一个索引指针和数组边界for),因此使用递归算法更方便:

\n\n
function mapOntoTextNodes(DOMNode $node, $callback) {\n    if ($node->nodeType === XML_TEXT_NODE) {\n        return $callback($node);\n    }\n    for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) {\n        $nodesChanged = 0;\n        switch ($node->childNodes->item($i)->nodeType) {\n            case XML_ELEMENT_NODE:\n                $nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback);\n                break;\n            case XML_TEXT_NODE:\n                $nodesChanged = $callback($node->childNodes->item($i));\n                break;\n        }\n        if ($nodesChanged !== 0) {\n            $n += $nodesChanged;\n            $i += $nodesChanged;\n        }\n    }\n}\nfunction foo(DOMText $node) {\n    $pattern = "~((?:http|https|ftp)://(?:\\S*?\\.\\S*?))(?=\\s|\\;|\\)|\\]|\\[|\\{|\\}|,|\\"|\'|:|\\<|$|\\.\\s)~i";\n    $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);\n    $n = count($parts);\n    if ($n > 1) {\n        $parentNode = $node->parentNode;\n        $doc = $node->ownerDocument;\n        for ($i=1; $i<$n; $i+=2) {\n            $a = $doc->createElement(\'a\');\n            $a->setAttribute(\'href\', $parts[$i]);\n            $a->setAttribute(\'target\', \'_blank\');\n            $a->appendChild($doc->createTextNode($parts[$i]));\n            $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);\n            $parentNode->insertBefore($a, $node);\n        }\n        $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);\n        $parentNode->removeChild($node);\n    }\n    return $n-1;\n}\n\n$str = \'<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>\';\n$doc = new DOMDocument();\n$doc->loadHTML($str);\n$elems = $doc->getElementsByTagName(\'body\');\nmapOntoTextNodes($elems->item(0), \'foo\');\n
Run Code Online (Sandbox Code Playgroud)\n\n

这里mapOntoTextNodes用于将给定的回调函数映射到DOM 文档中的每个DOMText节点。您可以传递整个DOMDocument节点,也可以仅传递特定的DOMNode(在本例中仅BODY传递节点)。

\n\n

然后,该函数foo用于查找并替换DOMText节点\xe2\x80\x99s 内容中的纯 URL,方法是使用while capture将内容字符串拆分为非 URL \xe2\x80\x8d/\xe2\x80\x8d URL部分preg_split使用的分隔符生成 1+2\xc2\xb7 n个项目的数组。然后,非 URL部分将被新的DOMText节点替换, URL部分将被新元素替换,A然后将这些新元素插入到原始DOMText节点之前,然后在末尾将其删除。由于这是递归执行的,因此只需在特定的DOMNodemapOntoTextNodes上调用该函数就足够了。

\n

  • @Andri:但是使用正则表达式可能会产生意想不到的结果,因为 HTML 是一种不规则语言。 (6认同)