DOMDocument对我的字符串做了什么？

Question

DOMDocument对我的字符串做了什么？

$dom = new DOMDocument('1.0', 'UTF-8');

$str = '<p>Hello®</p>';

var_dump(mb_detect_encoding($str)); 

$dom->loadHTML($str);

var_dump($dom->saveHTML());

Run Code Online (Sandbox Code Playgroud)

查看.

输出

string(5) "UTF-8"

string(158) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello&Acirc;&reg;</p></body></html>
"

Run Code Online (Sandbox Code Playgroud)

为什么我的Unicode ®被转换为Â®？如何阻止它？

^{我今天疯了吗？}

Answer 1

小智 5

您可以添加xml编码标签（以后再取出）。这适用于我没有Centos 5.x的东西（ubuntu，cpanel的php）：

<?php
$dom = new DOMDocument('1.0', 'UTF-8');
$str = '<p>Hello®</p>';
var_dump(mb_detect_encoding($str)); 
$dom->loadHTML('<?xml encoding="utf-8">'.$str);
var_dump($dom->saveHTML());

Run Code Online (Sandbox Code Playgroud)

这是您得到的：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello&reg;</p></body></html>

Run Code Online (Sandbox Code Playgroud)

除非您得到以下信息：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello&Acirc;&reg;</p></body></html>

Run Code Online (Sandbox Code Playgroud)

Answer 2

Ign*_*ams 2

您的文本编辑器"\xc2\xae"以 UTF-8 表示，但文件中的字节"\xc3\x82\xc2\xae"以 Latin-1（或类似的编码）表示，这就是 PHP 用来读取它的内容。使用字符实体引用将消除这种歧义。

\n\n

>>> print u\'\xc2\xae\'.encode(\'utf-8\').decode(\'latin-1\')\n\xc3\x82\xc2\xae\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	14 年，8 月前
查看次数：	1584 次
最近记录：	13 年，5 月前