将XML非法和char转换为utf8 - python

alv*_*vas 5 html python xml unicode

XML和HTML字符引用列表位于:https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references.

但是,有些内容在该列表中根本没有定义,但它们在较旧的HTML脚本中使用.当我Senseval-2 format (with fixes)http://www.d.umn.edu/~tpederse/data.html处理数据集时,遇到以下单词,它会破坏我试图xml.et.elementTree用来解析数据的脚本.

这些单词的unicode等价是什么?

&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
&degree.
&degree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
&times.
Run Code Online (Sandbox Code Playgroud)

我的剧本:

import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()
Run Code Online (Sandbox Code Playgroud)

给出这个追溯:

Traceback (most recent call last):
  File "senseval.py", line 4, in <module>
    tree = et.parse(s1)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
    parser.feed(data)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41, column 113
Run Code Online (Sandbox Code Playgroud)

wil*_*lfo 3

我发现这个答案使得可以使用 Python lxml 包解析您的 xml:

使用 Python 和 lxml 获取数据

从这里安装 lxml 包: http: //lxml.de/

并使用这段代码:

import lxml.html
root = lxml.html.parse('train-fix.xml').getroot()
Run Code Online (Sandbox Code Playgroud)

希望它对你有用