DOMDocument 和 HTML 实体

Question

DOMDocument 和 HTML 实体

raf*_*afa 5 php character-encoding domdocument

我正在尝试解析一些包含一些 HTML 实体的 HTML，例如 ×

$str = '<a href="http://example.com/"> A &#215; B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";

Run Code Online (Sandbox Code Playgroud)

但是 DomDocument 将文本替换为 A × B。

有什么方法可以防止它为 HTML 实体使用 & 并让它不理会它吗？我试图将替换实体设置为假，但它没有做任何事情

Answer 1

Pet*_*hof 1

您确定 & 被替换为吗&？如果是这种情况，您会看到确切的实体，作为文本，而不是您得到的乱码响应。

\n\n

我的猜测是它被转换为实际的字符，并且您正在使用 latin1 字符集查看页面，该字符集不包含该字符，因此会出现乱码响应。

\n\n

如果我渲染你的示例，我的输出是：

\n\n

fullname:  A \xc3\x97 B \n\nhref: http://example.com/\n

Run Code Online (Sandbox Code Playgroud)\n\n

当在 latin1/iso-8859-1 中查看此内容时，我看到了您所描述的输出。但是当我将字符集设置为UTF-8时，输出就很好。

\n

归档时间：	14 年，5 月前
查看次数：	8020 次
最近记录：	4 年，9 月前