Spa*_*Dog 1 html php parsing dom
我正在解析HTML页面.在某些时候,我得到div之间的文本,并使用html_entity_decode打印该文本.
问题是页面包含像这个星形的字符?
或像⬛︎,◄,◉等形状的其他字符.我检查过这些字符没有在源页面上编码,它们就像你正常看到它们一样.
该页面使用的是charset ="UTF-8"
所以,当我使用时
html_entity_decode($string, ENT_QUOTES, 'UTF-8');
Run Code Online (Sandbox Code Playgroud)
例如,这颗恒星被"解码"为 â˜
$ string是通过使用获得的
document.getElementById("id-of-div").innerText
Run Code Online (Sandbox Code Playgroud)
我想正确解码它们.我如何在PHP中执行此操作?
注意:我已经尝试过htmlspecialchars_decode($string, ENT_QUOTES);
它会产生相同的结果.
我试图用这个简单的PHP重现你的问题:
<?php
// Make sure our client knows we're sending UTF-8
header('Content-Type: text/plain; charset=utf-8');
$string = "The page contains characters like this star ? or others like shapes like ??, ?, ?, etc. Here are some entities: This is a "test".";
echo 'String: ' . $string . "\n";
echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');
Run Code Online (Sandbox Code Playgroud)
正如预期的那样,输出是:
String: The page contains characters like this star ? or others like shapes like ??, ?, ?, etc. Here are some entities: This is a "test".
Decoded: The page contains characters like this star ? or others like shapes like ??, ?, ?, etc. Here are some entities: This is a "test".
Run Code Online (Sandbox Code Playgroud)
如果我将标题中的字符集更改为iso-8859-1
,我会看到:
String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
Run Code Online (Sandbox Code Playgroud)
所以,我会说你的问题是显示问题.html_entity_decode
正如你所料,"有趣"的角色完全不受影响.只是无论你有什么代码,或者你用来查看输出的任何代码,都是错误地使用iso-8859-1来显示它们.
归档时间: |
|
查看次数: |
1539 次 |
最近记录: |