解析包含和使用Python的HTML页面

Man*_*uel 3 urllib2 elementtree python-2.7

我试图使用urllib2和ElementTree解析python中的HTML页面,我在解析HTML时遇到了麻烦.网页在引用的字符串中包含"&"但ElementTree会为包含&的行抛出parseError

脚本:

import urllib2

url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()

import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)
Run Code Online (Sandbox Code Playgroud)

这会在Python 2.7中引发跟随错误

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73
Run Code Online (Sandbox Code Playgroud)

错误对应于以下行

<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />
Run Code Online (Sandbox Code Playgroud)

看起来当读取HTML页面时,&符号不会像&amp;变量r 那样被解析

我试图使用R程序使用htmlTreeParse解析,"&"转换为&amp;正确.

如果我在urllib2中遗漏了任何内容,请告诉我

编辑:我将"&"替换为"&" &amp;但是第904行包含<javascript中的符号,这会引发相同的错误.应该有一个更好的选择,而不是替换字符.

LINE:904    for (i = 0; i < strac.length - 1; i++) {
Run Code Online (Sandbox Code Playgroud)

ale*_*cxe 5

首先,xml.etree.ElementTree是一个XML解析器.它不能处理开箱即用的HTML实体.&非法的事情有XML里面,这就是为什么它是失败的.

开始使用真正的专业HTML解析器BeautifulSoup:

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
>>> soup = BeautifulSoup(urlopen(url))
>>> soup.find('td').text.strip()
u'ELECTION COMMISSION OF INDIA'
Run Code Online (Sandbox Code Playgroud)

也可以看看: