我目前正在使用 PHP 来获取网页DOMXPath所有元素的内容:<p>
<?php
...
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$paragraphs = $xpath->evaluate("/html/body//p");
foreach ($paragraphs as $paragraph){
echo $paragraph->textContent . "<br />";
}
Run Code Online (Sandbox Code Playgroud)
我的问题是,产生的字符串textContent不尊重<br />这些元素中存在的标签<p>。相反,它删除了换行符并将通常位于不同行上的单词推到一起。例如:
示例 HTML:
<p>
Some happy talk goes here talking about our great product.<br />
We would love for you to buy it!
</p>
<p>
Random information and what not<br />
Isn't that cool?
</p>
Run Code Online (Sandbox Code Playgroud)
上面 PHP 的当前输出:
Some happy talk about our …Run Code Online (Sandbox Code Playgroud)