ElementTree Unicode编码/解码错误

B8v*_*ede 3 python unicode elementtree python-2.7

对于一个项目,我应该增强一些XML并将其存储在一个文件中.我遇到的问题是我不断收到以下错误:

Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 193, in parse_references
    outputXML = ET.tostring(root, encoding='utf8', method='xml')
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
 ECLI:NL:RVS:2012:BY1564
 File "C:\Python27\lib\xml\etree\ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

该错误产生于:

outputXML = ET.tostring(root, encoding='utf8', method='xml')
Run Code Online (Sandbox Code Playgroud)

在寻找这个问题的解决方案时,我发现了一些建议,说我应该添加.decode('utf-8')到函数中,但是这会导致编写函数的编码错误(首先是解码),这样就不起作用......

编码错误:

Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 197, in parse_references
    myfile.write(outputXML)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 13559: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

它由以下代码生成:

outputXML = ET.tostring(root, encoding='utf8', method='xml').decode('utf-8')
Run Code Online (Sandbox Code Playgroud)

来源(或至少相关部分):

# URL encodes the parameters
encoded_parameters = urllib.urlencode({'id':ecli})

# Opens XML file
feed = urllib2.urlopen("http://data.rechtspraak.nl/uitspraken/content?"+encoded_parameters, timeout = 3)

# Parses the XML
ecliFile = ET.parse(feed)

# Fetches root element of current tree
root = ecliFile.getroot()

# Write the XML to a file without any extra indents or newlines
outputXML = ET.tostring(root, encoding='utf8', method='xml')

# Write the XML to the file
with open(file, "w") as myfile:
    myfile.write(outputXML)
Run Code Online (Sandbox Code Playgroud)

最后但并非最不重要的是XML示例的URL:http://data.rechtspraak.nl/uitspraken/content?id = ECLI:NL:NVS:2012:BY1542

Mar*_*ers 6

异常是由字节字符串值引起的.

text在traceback中应该是一个unicode值,但如果它是一个普通的字节字符串,Python将隐式地首先将它(使用ASCII编解码器)解码为Unicode,这样你就可以再次编码它.

这是解码失败.

因为您实际上没有向我们展示您插入到XML树中的内容,所以除了确保在插入文本时始终使用Unicode值时,很难告诉您要修复的内容.

演示:

>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'.encode('utf8')
>>> ET.tostring(root, encoding='utf8', method='xml')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
    v = _escape_attrib(v, encoding)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1090, in _escape_attrib
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 31: ordinal not in range(128)
>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'
>>> ET.tostring(root, encoding='utf8', method='xml')
'<?xml version=\'1.0\' encoding=\'utf8\'?> ...'
Run Code Online (Sandbox Code Playgroud)

设置包含ASCII范围之外的字节的bytestring属性会触发异常; 使用unicode值确保可以生成结果.