Python:Unicode和ElementTree.parse

Question

Python:Unicode和ElementTree.parse

San*_*nta 10 python xml unicode python-3.x

我正在尝试迁移到Python 2.7,因为Unicode在那里是一个大交易,我会尝试用XML文件和文本处理它们并使用xml.etree.cElementTree库解析它们.但我碰到了这个错误:

>>> import xml.etree.cElementTree as ET
>>> from io import StringIO
>>> source = """\
... <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
... <root>
...   <Parent>
...     <Child>
...       <Element>Text</Element>
...     </Child>
...   </Parent>
... </root>
... """
>>> srcbuf = StringIO(source.decode('utf-8'))
>>> doc = ET.parse(srcbuf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

Run Code Online (Sandbox Code Playgroud)

使用io.open('filename.xml', encoding='utf-8')传递给同样的事情ET.parse:

>>> with io.open('test.xml', mode='w', encoding='utf-8') as fp:
...     fp.write(source.decode('utf-8'))
...
150L
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     fp.read()
...
u'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>\n<root>\n  <Parent>\n
    <Child>\n      <Element>Text</Element>\n    </Child>\n  </Parent>\n</root>\n
'
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     ET.parse(fp)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

Run Code Online (Sandbox Code Playgroud)

有没有关于unicode和ET解析的东西,我在这里缺少？

编辑:显然,ET解析器与unicode输入流不能很好地配合？以下作品:

>>> with io.open('test.xml', mode='rb') as fp:
...     ET.parse(fp)
...
<ElementTree object at 0x0180BC10>

Run Code Online (Sandbox Code Playgroud)

但这也意味着我不能使用,io.StringIO如果我想从内存中解析文本,除非我先将其编码到内存缓冲区中？

Answer 1

Gly*_*yph 15

你的问题是你正在喂ElementTreeunicode,但它更喜欢消耗字节.它会在任何情况下为您提供unicode.

在Python 2.x中,它只能消耗字节.您可以告诉它这些字节的编码是什么,但就是这样.所以,如果你从字面上有一个表示对象的工作文本文件,比如io.StringIO,首先你需要将其转换成别的东西.

如果你真的开始使用UTF-8编码的2.x- str(AKA bytes),在内存中,就像在你的例子中一样,用xml.etree.cElementTree.XML一次性将它解析成XML并且不要担心任何这些:-) .

如果你想,可以应付一个逐步从文件中读取数据的接口,使用xml.etree.cElementTree.parse与io.BytesIO把它转换成字节的内存流,而不是一个字符内存的字符串.如果要使用io.open,请将其与b标志一起使用,以便获得字节流.

在Python 3.x中,您可以直接将unicode传递给ElementTree,这样更方便,可以说更新版本的ElementTree更为正确.但是,您仍然可能不希望,并且Python 3的版本仍然接受字节作为输入.无论如何,你总是从字节开始:通过将它们直接从输入源传递给ElementTree,你可以让它在XML解析引擎中智能地进行编码或解码,以及对编码声明进行动态检测.在输入流中,您可以使用XML,但不能使用任意文本数据.因此,让XML解析器完成解码工作是放置责任的正确位置.

我希望你添加了一些示例代码 (11认同)

Answer 2

小智 7

我在Python 2.6中遇到了与你相同的问题.

似乎Python 2.x和3.x版本中cElementTree.parse的"utf-8"编码是不同的.在Python 2.x中,我们可以使用XMLParser对unicode进行编码.例如:

import xml.etree.cElementTree as etree

parser = etree.XMLParser(encoding="utf-8")
targetTree = etree.parse( "./targetPageID.xml", parser=parser )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)

Run Code Online (Sandbox Code Playgroud)

您可以参考此页面获取XMLParser方法("XMLParser"部分):http://effbot.org/zone/elementtree-13-intro.htm

虽然以下方法适用于Python 3.x版本:

import xml.etree.cElementTree as etree
import codecs

target_file = codecs.open("./targetPageID.xml",mode='r',encoding='utf-8')

targetTree = etree.parse( target_file )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)

Run Code Online (Sandbox Code Playgroud)

希望这可以帮到你.

Answer 3

And*_*ner 5

你不能用吗

doc = ET.fromstring(source)

Run Code Online (Sandbox Code Playgroud)

在你的第一个例子中？

归档时间：	15 年，5 月前
查看次数：	26948 次
最近记录：	13 年，4 月前