PHP - html_entity_decode没有解码所有内容

Spa*_*Dog 1 html php parsing dom

我正在解析HTML页面.在某些时候,我得到div之间的文本,并使用html_entity_decode打印该文本.

问题是页面包含像这​​个星形的字符?或像⬛︎,◄,◉等形状的其他字符.我检查过这些字符没有在源页面上编码,它们就像你正常看到它们一样.

该页面使用的是charset ="UTF-8"

所以,当我使用时

html_entity_decode($string, ENT_QUOTES, 'UTF-8');
Run Code Online (Sandbox Code Playgroud)

例如,这颗恒星被"解码"为 â˜

$ string是通过使用获得的

document.getElementById("id-of-div").innerText
Run Code Online (Sandbox Code Playgroud)

我想正确解码它们.我如何在PHP中执行此操作?

注意:我已经尝试过htmlspecialchars_decode($string, ENT_QUOTES);它会产生相同的结果.

Mat*_*son 5

我试图用这个简单的PHP重现你的问题:

<?php
  // Make sure our client knows we're sending UTF-8
  header('Content-Type: text/plain; charset=utf-8');
  $string = "The page contains characters like this star ? or others like shapes like ??, ?, ?, etc. Here are some entities: This is a &quot;test&quot;.";
  echo 'String: ' . $string . "\n";
  echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');
Run Code Online (Sandbox Code Playgroud)

正如预期的那样,输出是:

String: The page contains characters like this star ? or others like shapes like ??, ?, ?, etc. Here are some entities: This is a &quot;test&quot;.
Decoded: The page contains characters like this star ? or others like shapes like ??, ?, ?, etc. Here are some entities: This is a "test".
Run Code Online (Sandbox Code Playgroud)

如果我将标题中的字符集更改为iso-8859-1,我会看到:

String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a &quot;test&quot;.
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
Run Code Online (Sandbox Code Playgroud)

所以,我会说你的问题是显示问题.html_entity_decode正如你所料,"有趣"的角色完全不受影响.只是无论你有什么代码,或者你用来查看输出的任何代码,都是错误地使用iso-8859-1来显示它们.