将unicode对象传递给XML解析器时出现Unicode错误

jlc*_*lin 2 python unicode xml-parsing

我试图读取一个包含xml和unicode的gzip文件,但是我收到了一个错误.我使用的代码是:

import gzip
import xml

path = "index.mjml.gz"
gzFile = gzip.open(path, mode='r')
gzContents = gzFile.read()
gzFile.close()

unicodeContents = gzContents.encode('utf-8')
xmlContent = xml.dom.minidom.parseString(unicodeContents)
# Do stuff with xmlContent
Run Code Online (Sandbox Code Playgroud)

当我运行此代码时,我收到以下错误(在开头的行上失败xmlContent)

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/minidom.pyc in parseString(string, parser)
   1922     if parser is None:
   1923         from xml.dom import expatbuilder
-> 1924         return expatbuilder.parseString(string)
   1925     else:
   1926         from xml.dom import pulldom

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/expatbuilder.pyc in parseString(string, namespaces)
    938     else:
    939         builder = ExpatBuilder()
--> 940     return builder.parseString(string)
    941 
    942 

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/xml/dom/expatbuilder.pyc in parseString(self, string)
    221         parser = self.getParser()
    222         try:
--> 223             parser.Parse(string, True)
    224             self._setup_subset(string)
    225         except ParseEscape:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1141336: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

我发现之前的答案类似于从python中的gzip文件中读取utf-8字符,但我仍然收到错误.

xml解析器有问题吗?

(我正在使用Python 2.7.?)

ekh*_*oro 5

您无法将unicode字符串传递给xml.dom.minidom.parseString.

它必须是一个适当编码的字节串:

>>> import xml.dom.minidom as xmldom
>>>
>>> source = u"""\
... <?xml version="1.0" encoding="utf-8"?>
... <root><text>?? ??????? ??? ??? ????</text></root>
... """
>>> doc = xmldom.parseString(source.encode('utf-8'))
>>> print doc.getElementsByTagName('text')[0].toxml()
<text>?? ??????? ??? ??? ????</text>
Run Code Online (Sandbox Code Playgroud)

编辑

只是为了澄清一下 - 从gzip压缩的xml文件读取的流应该直接传递给解析器而不尝试编码或解码它:

import gzip
import xml

path = "index.mjml.gz"
gzFile = gzip.open(path, mode='r')
gzContents = gzFile.read()
gzFile.close()

xmlContent = xml.dom.minidom.parseString(gzContents)
Run Code Online (Sandbox Code Playgroud)

解析器将从文件开头的xml声明中读取编码(如果没有,则假定为"utf-8").然后它可以使用它将内容解码为unicode.