And*_*dri 13 html php regex url
我需要你帮忙.
我想转此:
sometext sometext http://www.somedomain.com/index.html sometext sometext
Run Code Online (Sandbox Code Playgroud)
成:
sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext
Run Code Online (Sandbox Code Playgroud)
我使用这个正则表达式管理它:
preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);
Run Code Online (Sandbox Code Playgroud)
问题是它还替换了imgURL,例如:
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext
Run Code Online (Sandbox Code Playgroud)
变成了:
sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext
Run Code Online (Sandbox Code Playgroud)
请帮忙.
精简版Gumbo上面:
$html = <<< HTML
<html>
<body>
<p>
This is a text with a <a href="http://example.com/1">link</a>
and another <a href="http://example.com/2">http://example.com/2</a>
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like <img src="http://example.com/foo"/> but these should
not be replaced either. In fact, only URLs in text that is no
a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;
Run Code Online (Sandbox Code Playgroud)
让我们使用一个XPath,它只获取那些实际上是包含http://或https://或ftp://的文本节点的元素,而这些元素本身并不是锚元素的文本节点.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
'/html/body//text()[
not(ancestor::a) and (
contains(.,"http://") or
contains(.,"https://") or
contains(.,"ftp://") )]'
);
Run Code Online (Sandbox Code Playgroud)
上面的XPath将为我们提供一个包含以下数据的TextNode:
and also another http://example.com with the latter being the
only one that should be replaced. There is also images in this
text, like
Run Code Online (Sandbox Code Playgroud)
从PHP5.3开始,我们也可以在XPath中使用PHP来使用Regex模式来选择我们的节点而不是三次调用contains.
我们将使用文档片段而不是将整个textnode替换为片段,而不是以符合标准的方式拆分文本节点.在这种情况下,非标准仅意味着,我们将使用的方法不是DOM API的W3C规范的一部分.
foreach ($texts as $text) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML(
preg_replace(
"~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
'<a href="$1">$1</a>',
$text->data
)
);
$text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);
Run Code Online (Sandbox Code Playgroud)
然后输出:
<html><body>
<p>
This is a text with a <a href="http://example.com/1">link</a>
and another <a href="http://example.com/2">http://example.com/2</a>
and also another <a href="http://example.com">http://example.com</a> with the latter being the
only one that should be replaced. There is also images in this
text, like <img src="http://example.com/foo"/> but these should
not be replaced either. In fact, only URLs in text that is no
a descendant of an anchor element should be converted to a link.
</p>
</body></html>
Run Code Online (Sandbox Code Playgroud)
您不应该\xe2\x80\x99t 使用正则表达式\xe2\x80\x93 来执行此操作,至少不应该仅使用正则表达式。请使用适当的 HTML DOM 解析器,例如PHP\xe2\x80\x99s DOM 库之一。然后,您可以迭代节点,检查它\xe2\x80\x99 是否是文本节点,并执行正则表达式搜索并适当地替换文本节点。
\n\n像这样的事情应该这样做:
\n\n$pattern = "~((?:http|https|ftp)://(?:\\S*?\\.\\S*?))(?=\\s|\\;|\\)|\\]|\\[|\\{|\\}|,|\\"|\'|:|\\<|$|\\.\\s)~i";\n$doc = new DOMDocument();\n$doc->loadHTML($str);\n// for every element in the document\nforeach ($doc->getElementsByTagName(\'*\') as $elem) {\n // for every child node in each element\n foreach ($elem->childNodes as $node) {\n if ($node->nodeType === XML_TEXT_NODE) {\n // split the text content to get an array of 1+2*n elements for n URLs in it\n $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);\n $n = count($parts);\n if ($n > 1) {\n $parentNode = $node->parentNode;\n // insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node\n for ($i=1; $i<$n; $i+=2) {\n $a = $doc->createElement(\'a\');\n $a->setAttribute(\'href\', $parts[$i]);\n $a->setAttribute(\'target\', \'_blank\');\n $a->appendChild($doc->createTextNode($parts[$i]));\n $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);\n $parentNode->insertBefore($a, $node);\n }\n // insert the last part before the original DOMText node\n $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);\n // remove the original DOMText node\n $node->parentNode->removeChild($node);\n }\n }\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n\n好的,由于和的DOMNodeList \xe2\x80\x8ds是活动的,DOM 中的每个更改都会反映到该列表,因此您不能使用它来迭代新添加的节点。相反,您需要使用循环并跟踪添加的元素以增加索引指针,并且最多适当地预先计算数组边界。getElementsByTagNamechildNodesforeachfor
但由于这在如此复杂的算法中相当困难(三个循环中的每一个都需要一个索引指针和数组边界for),因此使用递归算法更方便:
function mapOntoTextNodes(DOMNode $node, $callback) {\n if ($node->nodeType === XML_TEXT_NODE) {\n return $callback($node);\n }\n for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) {\n $nodesChanged = 0;\n switch ($node->childNodes->item($i)->nodeType) {\n case XML_ELEMENT_NODE:\n $nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback);\n break;\n case XML_TEXT_NODE:\n $nodesChanged = $callback($node->childNodes->item($i));\n break;\n }\n if ($nodesChanged !== 0) {\n $n += $nodesChanged;\n $i += $nodesChanged;\n }\n }\n}\nfunction foo(DOMText $node) {\n $pattern = "~((?:http|https|ftp)://(?:\\S*?\\.\\S*?))(?=\\s|\\;|\\)|\\]|\\[|\\{|\\}|,|\\"|\'|:|\\<|$|\\.\\s)~i";\n $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);\n $n = count($parts);\n if ($n > 1) {\n $parentNode = $node->parentNode;\n $doc = $node->ownerDocument;\n for ($i=1; $i<$n; $i+=2) {\n $a = $doc->createElement(\'a\');\n $a->setAttribute(\'href\', $parts[$i]);\n $a->setAttribute(\'target\', \'_blank\');\n $a->appendChild($doc->createTextNode($parts[$i]));\n $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);\n $parentNode->insertBefore($a, $node);\n }\n $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);\n $parentNode->removeChild($node);\n }\n return $n-1;\n}\n\n$str = \'<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>\';\n$doc = new DOMDocument();\n$doc->loadHTML($str);\n$elems = $doc->getElementsByTagName(\'body\');\nmapOntoTextNodes($elems->item(0), \'foo\');\nRun Code Online (Sandbox Code Playgroud)\n\n这里mapOntoTextNodes用于将给定的回调函数映射到DOM 文档中的每个DOMText节点。您可以传递整个DOMDocument节点,也可以仅传递特定的DOMNode(在本例中仅BODY传递节点)。
然后,该函数foo用于查找并替换DOMText节点\xe2\x80\x99s 内容中的纯 URL,方法是使用while capture将内容字符串拆分为非 URL \xe2\x80\x8d/\xe2\x80\x8d URL部分preg_split使用的分隔符生成 1+2\xc2\xb7 n个项目的数组。然后,非 URL部分将被新的DOMText节点替换, URL部分将被新元素替换,A然后将这些新元素插入到原始DOMText节点之前,然后在末尾将其删除。由于这是递归执行的,因此只需在特定的DOMNodemapOntoTextNodes上调用该函数就足够了。