我有一个包含HTML内容的数据库,它有一些带链接的文本.有些文本的URL中有哈希符号,有些则没有.
我需要删除带有哈希符号的链接,保留那些没有哈希符号的链接.
例:
输入:
<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
<li><a href="http://example.com/books/1#c1" >Chapter 1</a></li>
<li><a name="name before href" href="http://example.com/books/1#c2">Chapter 2</a></li>
<li><a href="http://example.com/books/1#c3" name="name after href">Chapter 3</a></li>
<li><a href="http://example.com/books/1#cN" target="_blank">Chapter N</a></li>
</ul>
<br><br>
<a href="http://example.com/books/1">Harry Potter</a>
<ul>
<li><a href="http://example.com/books/2#c1" target="_self">Chapter 1</a></li>
<li><a href="http://example.com/books/2#c2" name="some have name" title="some others have title" >Chapter 2</a></li>
<li><a href="http://example.com/books/2#c3">Chapter 3</a></li>
<li><a href="http://example.com/books/2#cN" >Chapter N</a></li>
</ul>
Run Code Online (Sandbox Code Playgroud)
期望的输出:
<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
<li>Chapter 1</li>
<li>Chapter 2</li>
<li>Chapter 3</li>
<li>Chapter N</li>
</ul>
<br><br>
<a href="http://example.com/books/2">Harry Potter</a>
<ul>
<li>Chapter 1</li>
<li>Chapter 2</li>
<li>Chapter 3</li>
<li>Chapter N</li>
</ul>
Run Code Online (Sandbox Code Playgroud)
我正在尝试使用此代码,但它删除了所有链接,我想保留那些没有哈希符号的链接.
$content = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $content);
Run Code Online (Sandbox Code Playgroud)
所以,目前我得到这个:
The Lord of the Rings
<ul>
<li>Chapter 1</li>
<li>Chapter 2</li>
<li>Chapter 3</li>
<li>Chapter N</li>
</ul>
<br><br>
Harry Potter
<ul>
<li>Chapter 1</li>
<li>Chapter 2</li>
<li>Chapter 3</li>
<li>Chapter N</li>
</ul>
Run Code Online (Sandbox Code Playgroud)
更多细节:
例:
<a href="http://example.com">
new line</a>
or
<a href="http://example.com">new
line</a>
Run Code Online (Sandbox Code Playgroud)
您应该避免使用正则表达式,而应该使用DOMDocument和DOMXPath.
<?php
$dom = new DOMDocument();
$dom->loadHtml('
<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
<li><a href="http://example.com/books/1#c1" >Chapter 1</a></li>
<li><a name="name before href" href="http://example.com/books/1#c2">Chapter 2</a></li>
<li><a href="http://example.com/books/1#c3" name="name after href">Chapter 3</a></li>
<li><a href="http://example.com/books/1#cN" target="_blank">Chapter N</a></li>
</ul>
<br><br>
<a href="http://example.com/books/1">Harry Potter</a>
<ul>
<li><a href="http://example.com/books/2#c1" target="_self">Chapter 1</a></li>
<li><a href="http://example.com/books/2#c2" name="some have name" title="some others have title" >Chapter 2</a></li>
<li><a href="http://example.com/books/2#c3">Chapter 3</a></li>
<li><a href="http://example.com/books/2#cN" >Chapter N</a></li>
</ul>
', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//a") as $link) {
$href = $link->getAttribute('href');
// link has a # in it, so replace with the links title
if (strpos($href, '#') !== false) {
$link->parentNode->nodeValue = $link->nodeValue;
}
}
echo $dom->saveHTML();
Run Code Online (Sandbox Code Playgroud)
结果:
<a href="http://example.com/books/1">The Lord of the Rings<ul>
<li>Chapter 1</li>
<li>Chapter 2</li>
<li>Chapter 3</li>
<li>Chapter N</li>
</ul><br><br><a href="http://example.com/books/1">Harry Potter</a><ul>
<li>Chapter 1</li>
<li>Chapter 2</li>
<li>Chapter 3</li>
<li>Chapter N</li>
</ul></a>
Run Code Online (Sandbox Code Playgroud)