Nokogiri无法使用UTF-16声明输出XML(理解和解决)

Phr*_*ogz 6 ruby xml character-encoding libxml2 nokogiri

摘要

尝试读取和序列化具有UTF-16编码和声明的XML文档会导致Nokogiri在某个点之后产生垃圾.

  1. 这是一个错误,还是有合理的解释?
  2. 什么是避免它的最佳方法?

环境

C:\>nokogiri -v
# Nokogiri (1.5.5)
    ---
    warnings: []
    nokogiri: 1.5.5
    ruby:
      version: 1.9.3
      platform: i386-mingw32
      description: ruby 1.9.3p194 (2012-04-20) [i386-mingw32]
      engine: ruby
    libxml:
      binding: extension
      compiled: 2.7.7
      loaded: 2.7.7
Run Code Online (Sandbox Code Playgroud)

细节

我有一个用UTF-16(LE)编码的XML文件,它还在顶部包含一个PI XML声明,表明编码是UTF-16.总结一下,它看起来像这样:

<?xml version="1.0" encoding="UTF-16" ?>
<Foo>
  <Bar><![CDATA[
Lorem ipsum dolor ...about 3900 more bytes of content here...
  ]]></Bar>
  <Jim>Oh! Hello there.</Jim>
</Foo>
Run Code Online (Sandbox Code Playgroud)

当我使用Nokogiri阅读本文时,一切似乎都很好:

xml = File.open('Simplified.xml','rb:utf-16le',&:read)
p xml.encoding                        # #<Encoding:UTF-16LE>
p xml.valid_encoding?                 # true
doc1 = Nokogiri.XML(xml,&:noblanks)
xml1 = doc1.to_xml.encode('utf-8')
p xml1.encoding                       # #<Encoding:UTF-8>
p xml1.valid_encoding?                # true
Run Code Online (Sandbox Code Playgroud)

但是,序列化文档的输出在某个点之后变得很大:

p xml1  # Correct contents of CDATA removed from the following output
#=> "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<Foo>\n  <Bar><![CDATA[\n...\n\t]]></Bar>\n  <Jim>Oh! Hello there.\uFFFE\u3C00\u0000\u2F00\u0000\u4A00\u0000\u6900\u0000\u6D00\u0000\u3E00\u0000\u0A00\u0000\u3C00\u0000\u2F00\u0000\u4600\u0000\u6F00\u0000\u6F00\u0000\u3E00\u0000\u0A00\u0000"
Run Code Online (Sandbox Code Playgroud)

(限制似乎与字符数有关.我可以在Lorem ipsum文本中添加和删除一些单词而不做任何更改,但删除某个点下方的文本会突然修复输出.)

然而,Nokogiri文件没有被打破.我可以<Jim>成功地独立序列化:

puts doc1.at('Jim').to_xml.encode('utf-8')
#=> <Jim>Oh! Hello there.</Jim>
Run Code Online (Sandbox Code Playgroud)

我发现的唯一解决方法是在解析之前删除文档顶部的XML声明.有了这个,所有工作都按照需要:

decl = '<?xml version="1.0" encoding="UTF-16" ?>'.encode('UTF-16LE')
doc2 = Nokogiri.XML(xml.sub(decl,''),&:noblanks)
puts doc2.to_xml.encode('utf-8')
#=> <?xml version="1.0"?>
#=> <Foo>
#=>   <Bar><![CDATA[
#=> Lorem ipsum dolor...and more...
#=>   ]]></Bar>
#=>   <Jim>Oh! Hello there.</Jim>
#=> </Foo>
Run Code Online (Sandbox Code Playgroud)

完整的XML

这是为自己测试的完整文件:

<?xml version="1.0" encoding="UTF-16" ?>
<Foo>
  <Bar><![CDATA[
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam ac augue arcu, eget laoreet lorem. Quisque ac augue velit. Integer consectetur suscipit vehicula. Etiam et convallis enim. Etiam varius massa sit amet lacus rhoncus varius in non ante. Sed dictum, metus eu bibendum ornare, ligula dui commodo urna, ut dignissim felis dolor eget nisl. Proin sit amet nisi nunc. Vestibulum a urna sed dui dignissim blandit nec vel enim. Vivamus tincidunt nulla id dui hendrerit hendrerit.
Aliquam neque orci, luctus sit amet fringilla eu, varius vitae diam. Suspendisse varius rutrum lorem eget malesuada. Sed dapibus dapibus nisl, in cursus ante lacinia non. Aenean id sagittis ipsum. Suspendisse elit nunc, porta sit amet blandit ut, laoreet sed est. Nunc eget sem vitae nisl elementum ullamcorper ut sit amet urna. Sed ligula quam, fringilla in facilisis tincidunt, vehicula in nisi. Maecenas a augue in augue semper scelerisque sit amet ut arcu.
Praesent hendrerit, enim in elementum ornare, lorem nisi euismod dolor, sit amet ornare mi sem sodales lacus. Fusce et tempor mauris. In non quam nisl, non consequat diam. Duis sit amet massa ultrices massa cursus iaculis. Nunc ullamcorper malesuada sem dignissim semper. Fusce aliquet lacus quis nisi tincidunt sodales. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque posuere commodo aliquet. Aliquam blandit vestibulum facilisis. Sed pellentesque viverra dignissim. Etiam est lacus, mollis eu pretium vitae, lacinia eleifend augue. Mauris vitae quam nisl. In venenatis nunc ac eros elementum cursus.
Sed a metus sit amet nunc euismod condimentum id non orci. Curabitur velit turpis, lacinia non eleifend sed, rhoncus id est. Fusce ut massa dolor, ut sodales odio. Donec aliquam convallis tellus, eu pharetra tortor iaculis non. Integer imperdiet feugiat ipsum a gravida. Mauris sapien ipsum, ultricies ac placerat ut, imperdiet eu justo. Quisque quis consectetur velit. Etiam facilisis sapien nec enim tincidunt pulvinar. Duis fermentum faucibus felis, sed consequat libero pretium at. Phasellus nibh purus, suscipit in vestibulum vel, blandit at leo. Suspendisse placerat elit sed enim bibendum vel hendrerit mauris pretium. Maecenas ut lacus eu nisi euismod pretium.
Aliquam feugiat felis id massa aliquam pharetra sed non eros. Morbi interdum molestie iaculis. Curabitur varius ante ac dui dapibus non laoreet risus blandit. Nunc sit amet magna lacus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Phasellus egestas nunc sed turpis imperdiet a rhoncus massa aliquam. Nulla facilisi. Phasellus sit amet neque felis, nec vestibulum massa. Donec luctus fringilla dolor et gravida. Phasellus euismod lectus eget elit hendrerit non vehicula tellus venenatis. Phasellus sit amet ligula et purus dignissim feugiat at vitae libero. Proin ut tortor eros, quis laoreet lectus. Quisque nec urna mattis ante gravida fermentum eu at nibh. Phasellus sapien elit, tincidunt quis laoreet id, lobortis sed magna. Aliquam pulvinar erat eu sapien pretium bibendum. Maecenas eleifend, leo quis sodales tincidunt, leo felis tristique dolor, vitae ultrices neque felis ut metus.
Etiam dignissim egestas ipsum, eget tempor ipsum rutrum eu. Donec vehicula eleifend ullamcorper. Mauris justo nulla, varius a mattis a, cursus sit amet risus. Phasellus rutrum interdum blandit. Donec ut justo eros, ut auctor dolor. Suspendisse potenti. Cras ultricies, dui eget mattis bibendum, leo dui luctus purus, sit amet rhoncus libero metus eget purus. Pellentesque scelerisque ornare sapien faucibus tempor.
Suspendisse potenti. Proin fermentum bibendum dapibus. Pellentesque facilisis aliquam. Nam egestas tellus non mauris scelerisque feugiat pellentesque lacus dignissim. Quisque id nulla felis. Mauris justo mauris, posuere sed facilisis in, venenatis nec risus. Mauris eu dui sed tellus laoreet tempor a in turpis volutpat.
  ]]></Bar>
  <Jim>Oh! Hello there.</Jim>
</Foo>
Run Code Online (Sandbox Code Playgroud)

mat*_*att 3

您可以在;的选项中指定要使用的编码,而不是序列化 xml 然后调用encode字符串。代替to_xml

\n\n
xml1 = doc1.to_xml.encode(\'utf-8\')\n
Run Code Online (Sandbox Code Playgroud)\n\n

使用:

\n\n
xml1 = doc1.to_xml(:encoding => \'utf-8\')\n
Run Code Online (Sandbox Code Playgroud)\n\n

这似乎可以解决问题。

\n\n
\n\n

至于\xe2\x80\x99是怎么回事,我只能提供一些观察。

\n\n

to_xml首先,在不指定编码的情况下生成的字符串的编码是UTF-16,在 Ruby 中是 \xe2\x80\x9cdummy 编码\xe2\x80\x9d (无论这意味着什么):

\n\n
xml1 = doc1.to_xml\np xml1.encoding\n#=> #<Encoding:UTF-16 (dummy)>\n
Run Code Online (Sandbox Code Playgroud)\n\n

文档对于虚拟编码是这样说的:

\n\n
\n

虚拟编码是未正确实现字符处理的编码。它用于有状态编码。

\n
\n\n

我注意到的另一件事是输出的 munged 部分中的值实际上对应于应该出现的代码点。

\n\n
xml1 = doc1.to_xml.encode(\'utf-8\')\n
Run Code Online (Sandbox Code Playgroud)\n\n

3Cis <2Fis /4Ais J69isi等,正在生产(如果忽略零和额外的 BOM)

\n\n
xml1 = doc1.to_xml(:encoding => \'utf-8\')\n
Run Code Online (Sandbox Code Playgroud)\n\n

如果您在编码为 UTF-8 之前写出 Nokogiri 生成的 XML)并用十六进制编辑器指向它,则开始看起来像这样:

\n\n
xml1 = doc1.to_xml\np xml1.encoding\n#=> #<Encoding:UTF-16 (dummy)>\n
Run Code Online (Sandbox Code Playgroud)\n\n

它以 开头FF FE,即小端 BOM。

\n\n

当咀嚼开始时,它看起来像这样:

\n\n
Hello there.\\uFFFE\\u3C00\\u0000\\u2F00\\u0000\\u4A00\\u0000\\u6900...\n
Run Code Online (Sandbox Code Playgroud)\n\n

fe ff是 munged 输出开始的地方(在中线上)。fe ff也是大端BOM,其他字符似乎是 BE(您可以看到零列 don\xe2\x80\x99t 在 之前和之后如何排列fe ff。尽管字符之间有额外的零字节对。

\n