nun*_*nos 3 python lxml html-parsing
我正在编写一个简单的脚本来从这里获取大灰色表.
我的代码如下:
import urllib2
from lxml import etree
html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read()
root = etree.XML(html)
Run Code Online (Sandbox Code Playgroud)
但是我在最后一个声明中收到错误.
Traceback (most recent call last):
File "D:\Workspace\afi100\afi100.py", line 13, in <module>
root = etree.XML(html)
File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577)
File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71683)
XMLSyntaxError: Space required after the Public Identifier, line 3, column 59
Run Code Online (Sandbox Code Playgroud)
任何想法如何解决这个错误?
谢谢.
您正在尝试使用XML解析器解析HTML,您应该使用lxml HTML解析器.
import urllib2
from StringIO import StringIO
from lxml import etree
ufile = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx")
root = etree.parse(ufile, etree.HTMLParser())
print etree.tostring(root)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3687 次 |
| 最近记录: |