Man*_*uel 3 urllib2 elementtree python-2.7
我试图使用urllib2和ElementTree解析python中的HTML页面,我在解析HTML时遇到了麻烦.网页在引用的字符串中包含"&"但ElementTree会为包含&的行抛出parseError
脚本:
import urllib2
url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()
import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)
Run Code Online (Sandbox Code Playgroud)
这会在Python 2.7中引发跟随错误
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73
Run Code Online (Sandbox Code Playgroud)
错误对应于以下行
<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />
Run Code Online (Sandbox Code Playgroud)
看起来当读取HTML页面时,&符号不会像&变量r 那样被解析
我试图使用R程序使用htmlTreeParse解析,"&"转换为&正确.
如果我在urllib2中遗漏了任何内容,请告诉我
编辑:我将"&"替换为"&" &但是第904行包含<javascript中的符号,这会引发相同的错误.应该有一个更好的选择,而不是替换字符.
LINE:904 for (i = 0; i < strac.length - 1; i++) {
Run Code Online (Sandbox Code Playgroud)
首先,xml.etree.ElementTree是一个XML解析器.它不能处理开箱即用的HTML实体.&是非法的事情有XML里面,这就是为什么它是失败的.
开始使用真正的专业HTML解析器BeautifulSoup:
>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
>>> soup = BeautifulSoup(urlopen(url))
>>> soup.find('td').text.strip()
u'ELECTION COMMISSION OF INDIA'
Run Code Online (Sandbox Code Playgroud)
也可以看看: