我使用dom doc从数据库加载html,如下所示:
$doc = new DOMDocument();
@$doc->loadHTML($data);
$doc->encoding = 'utf-8';
$doc->saveHTML();
Run Code Online (Sandbox Code Playgroud)
然后我通过这样做得到正文:
$bodyNodes = $doc->getElementsByTagName("body");
$words = htmlspecialchars($bodyNodes->item(0)->textContent);
Run Code Online (Sandbox Code Playgroud)
我得到的一切都包括在内<body>
.类似的事情<scripts>
也包括在内.我如何删除它们并仅保留真实文本内容?
您必须访问所有节点并返回其文本.如果某些节点包含其他节点,也请访问它们.
这可以通过这个基本的递归算法来完成:
extractNode:
if node is a text node or a cdata node, return its text
if is an element node or a document node or a document fragment node:
if it’s a script node, return an empty string
return a concatenation of the result of calling extractNode on all the child nodes
for everything else return nothing
Run Code Online (Sandbox Code Playgroud)
执行:
function extractText($node) {
if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
return $node->nodeValue;
} else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
if ('script' === $node->nodeName) return '';
$text = '';
foreach($node->childNodes as $childNode) {
$text .= extractText($childNode);
}
return $text;
}
}
Run Code Online (Sandbox Code Playgroud)
这将返回给定$节点的textContent,忽略脚本标记和注释.
$words = htmlspecialchars(extractText($bodyNodes->item(0)));
Run Code Online (Sandbox Code Playgroud)
在这里试试:http://codepad.org/CS3nMp7U
您可以使用XPath.
借用上面例子中使用的HTML arnaud:
$html = <<< HTML
<p>
test<span>foo<b>bar</b>
</p>
<script>
ignored
</script>
<!-- comment is ignored -->
<p>test</p>
HTML;
Run Code Online (Sandbox Code Playgroud)
您只需查询不是脚本标记的子项的所有文本节点,并且不评估为空字符串.您还将确保不保留WhiteSpace,因此不考虑用于格式化的空格.
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
$nodes = $xp->query('/html/body//text()[
not(ancestor::script) and
not(normalize-space(.) = "")
]');
foreach($nodes as $node) {
var_dump($node->textContent);
}
Run Code Online (Sandbox Code Playgroud)
将输出(演示)
string(10) "
test"
string(3) "foo"
string(3) "bar"
string(4) "test"
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3119 次 |
最近记录: |