使用lxml解析RSS时出现编码错误

dom*_*omi 9 python rss lxml chardet scraperwiki

我想用lxml解析下载的RSS,但我不知道如何处理UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
Run Code Online (Sandbox Code Playgroud)

但是我收到一个错误:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

Lui*_*ger 45

我遇到了类似的问题,事实证明这与编码没有任何关系.发生了什么事--lxml给你一个完全不相关的错误.在这种情况下,错误是.parse函数需要文件名或URL,而不是具有内容本身的字符串.但是,当它尝试打印出错误时,它会对非ascii字符进行阻塞,并显示完全混淆错误消息.非常不幸,其他人在这里评论了这个问题:

https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html

幸运的是,你的很容易解决.只需用.fromstring替换.parse,你应该完全不错:

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)

## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)
Run Code Online (Sandbox Code Playgroud)

刚刚在我的机器上进行了测试,它运行良好.希望能帮助到你!

  • 愿你的日子与永恒的美丽和谐一致! (8认同)

Ian*_* B. 0

您可能应该只尝试将字符编码定义为最后的手段,因为很清楚编码是基于 XML 序言(如果不是通过 HTTP 标头)。无论如何,除非您想要,否则没有必要将编码传递etree.XMLParser给覆盖编码;所以去掉这个encoding参数,它应该可以工作。

编辑:好吧,问题实际上似乎出在lxml. 无论出于何种原因,以下工作都有效:

parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)
Run Code Online (Sandbox Code Playgroud)