使用ElementTree用utf-8数据写xml utf-8文件

c0m*_*0m4 11 python elementtree

我正在尝试使用ElementTree编写带有utf-8编码数据的xml文件,如下所示:

#!/usr/bin/python                                                                       
# -*- coding: utf-8 -*-                                                                   

import xml.etree.ElementTree as ET
import codecs

testtag = ET.Element('unicodetag')
testtag.text = u'Töreboda' #The o is really ö (o with two dots over). No idea why SO dont display this
expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
expfile.close()
Run Code Online (Sandbox Code Playgroud)

这会导致错误

Traceback (most recent call last):
  File "unicodetest.py", line 10, in <module>
    ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
    serialize(write, self._root, encoding, qnames, namespaces)    
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

使用"us-ascii"编码可以正常工作,但不保留数据中的unicode字符.怎么了?

Mar*_*nen 19

codecs.open期望将Unicode字符串写入文件对象,它将处理UTF-8编码.ElementTree在write将Unicode字符串发送到文件对象之前将其编码为UTF-8字节字符串.由于文件对象需要Unicode字符串,因此它使用默认ascii编解码器将字节字符串强制回Unicode,并导致UnicodeDecodeError.

这样做:

#expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write('testunicode.xml',encoding="UTF-8",xml_declaration=True)
#expfile.close()
Run Code Online (Sandbox Code Playgroud)

  • +1.只是为了弄清楚这一点:问题在于你试图对unicode-> utf-8进行两次编码:ElementTree只执行一次,然后启用编解码器的流再次尝试执行此操作.但是这第二遍被混淆了,因为它的输入已经被编码(它需要一个unicode字符串,但是得到一个utf-8编码的字节字符串). (2认同)