使用UTF-8输出时,Python ElementTree不会转换非中断空格

Gre*_*son 10 python xml encoding elementtree

我正在尝试使用Python的ElementTree解析,操作和输出HTML:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import entitydefs

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update(entitydefs)
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')
Run Code Online (Sandbox Code Playgroud)

当我在Mac OS X 10.6上使用Python 2.7运行时,我得到:

<p>Less than &lt;</p>

Traceback (most recent call last):
  File "bar.py", line 20, in <module>
    print ET.tostring(p, encoding='utf-8')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1120, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 931, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1067, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 19: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

我认为指定"encoding ='UTF-8'"会处理不间断的空格字符,但显然它没有.我该怎么做呢?

lam*_*cck 7

0xA0是latin1字符,不是unicode字符,循环中p.text的值是str而不是unicode,这意味着为了在utf-8中对其进行编码,必须先将Python隐式转换为unicode字符串(即使用解码).当它这样做时,它假定为ascii,因为它没有被告知任何其他内容.0xa0不是有效的ascii字符,但它是一个有效的latin1字符.

你有latin1字符而不是unicode字符的原因是因为entitydefs是名称到latin1编码字符串的映射.你需要你可以从htmlentitydef.name2codepoint获得的unicode代码点

以下版本应该为您修复:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import name2codepoint

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update((x, unichr(i)) for x, i in name2codepoint.iteritems())
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')
Run Code Online (Sandbox Code Playgroud)


lav*_*nio 4

XML仅定义&lt;&gt;&apos;&quot;。其他的则来自 HTML。所以你有几个选择。&amp;&nbsp;

  1. 您可以更改源以使用数字实体,例如&#160;&#xA0;两者都相当于&nbsp;.
  2. 您可以使用定义这些值的 DTD。

XSLT FAQ中有一些有用的信息(它是关于 XSLT 的,但 XSLT 是使用 XML 编写的,因此同样适用)。


现在问题似乎包括堆栈跟踪;这会改变事情。你确定字符串在 中吗UTF-8?如果它解析为单字节0xA0,那么它不是UTF-8,但更有可能是cp1252or iso-8859-1