如何使用ElementTree正确解析utf-8 xml?

min*_*als 14 python xml elementtree xml-parsing python-2.7

我需要帮助才能理解为什么用xml.etree.ElementTree解析我的xml文件*会产生以下错误.

*我的测试xml文件包含阿拉伯字符.

任务: 打开并解析utf8_file.xml文件.

我的第一次尝试:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_tree = etree.parse(utf8_file)
Run Code Online (Sandbox Code Playgroud)

结果1:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

我的第二次尝试:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
    xml_tree  = etree.fromstring(xml_string)
Run Code Online (Sandbox Code Playgroud)

结果2:

AttributeError: 'file' object has no attribute 'getiterator'
Run Code Online (Sandbox Code Playgroud)

请解释上述错误并评论可能的解决方案.

Mar*_*ers 15

将字节解码到解析器; 千万不能先解码:

import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
    xml_tree = etree.parse(xml_file)
Run Code Online (Sandbox Code Playgroud)

XML文件必须在第一行中包含足够的信息以处理解析器的解码.如果缺少标头,则解析器必须假定使用UTF-8.

因为它是保存此信息的XML头,所以解析器负责进行所有解码.

您的第一次尝试失败,因为Python尝试再次编码 Unicode值,以便解析器可以按预期处理字节字符串.第二次尝试失败,因为etree.tostring()期望解析的树作为第一个参数,而不是unicode字符串.

  • `etree.parse(a_file)`默认处理Unicode.但是`etree.fromstring(a_string)`直到Python 3.x(参见http://bugs.python.org/issue11033)才解析字符串,你必须手动编码,如`etree.fromstring( a_string.encode( 'UTF-8'))`. (4认同)