解码� 真正的性格

Son*_*ute 6 xml unicode

当我从twitter的Stream API读取数据然后写入xmlfile时.

但是一些特殊的角色�会导致错误(我的意思是当我在Chrome中打开xmlfile时,Chrome表示该角色出错了!)

我想�在写入xmlfile之前将编码序列()转换为真实字符( )!

怎么实现这个?

- - - - - - -添加 - - - - - - -

这是XMLFile内容:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<text>@carlyraejepsen would be a dream if you follow me, please follow me?, I love you so much you're my inspiration</text>
<text>someone please bring me a caramel apple and a mocha from black cat. i'll love you forever</text>
<text>“@G_MartinFlyKick: Marry me Juliet.I love you and that's all I really know.”&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;&#55357;&#56834;</text>
<text>"I need to see a picture of him cuz Im trying to imagine you guys making love and all I see is u climbing on top of a big question mark"lmao</text>
<text>@District3music hi, I LOVE YOU follow me please? &amp;lt;3 xx 23</text>
<text>RT @syardley_: So appreciative of my family and people I love, wouldn't be where I am without them. #thankful</text>
<text>#DISTRICT3HALLOWEENFOLLOWSPREE #DISTRICT3HALLOWEENFOLLOWSPREE #3EEKERFROMTHENETHERLANDS love you! Please follow ? @District3music x42</text>
<text>Arguably my favorite electronic music producer @Kluteuk is coming back to Toronto on Dec 22nd. So stoked. Guy has made so many tunes I LOVE.</text>
<text>The stakes are high, the water's rough, but this love is ours.</text>
<text>@NiallOfficial Answer me, I love you very much. Venezuela loves. jhgj</text>
<text>Love this shit http://t.co/qSP79NKx</text>
</root>
Run Code Online (Sandbox Code Playgroud)

以下是Chrome的错误:

This page contains the following errors:

error on line 5 at column 91: xmlParseCharRef: invalid xmlChar value 55357
Below is a rendering of the page up to the first error.
Run Code Online (Sandbox Code Playgroud)

Juk*_*ela 15

字符引用&#55357;表示代理代码点(U + D83D),因此尝试将其转换为字符是错误的.它不是一个角色,甚至不是半个角色.

您需要追溯到生成引用的位置.原因可能是字符编码混乱.在UTF-16中,代理代码单元可能出现但必须在数据被解释为字符时成对处理,例如转换为另一种编码或转换为字符引用.

  • 从XMLFile内容来看,似乎数据包含像U + 1F602""这样的字符,这意味着它占用UTF-16中的两个代码单元.显然原始数据是UTF-16,应首先转换为UTF-8. (2认同)