Pas*_* W. 2 python html-parsing
从URL检索到的特定页面具有以下语法:
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
Run Code Online (Sandbox Code Playgroud)
我想提取“名称”,“姓氏”等中的数据(我必须在很多页面上重复执行此任务)
为此,我尝试使用以下代码:
import urllib2
url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)
start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]
start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]
print(givenName)
print(surname)
Run Code Online (Sandbox Code Playgroud)
当我仅一次调用source.read.split方法时,它可以正常工作。但是当我使用它两次时,它给出了超出范围错误的列表索引。
有人可以提出解决方案吗?
您可以使用BeautifulSoup解析HTML字符串。
这是您可以尝试
使用的一些代码,它使用BeautifulSoup(获取html代码生成的文本),然后解析字符串以提取数据。
from bs4 import BeautifulSoup as bs
dic = {}
data = \
"""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()
# parsing the text
lines = text.splitlines()
for line in lines:
# check if line has ':', if it doesn't, move to the next line
if line.find(':') == -1:
continue
# split the string at ':'
parts = line.split(':')
# You can add more tests here like
# if len(parts) != 2:
# continue
# stripping whitespace
for i in range(len(parts)):
parts[i] = parts[i].strip()
# adding the vaules to a dictionary
dic[parts[0]] = parts[1]
# printing the data after processing
print '%16s %20s' % (parts[0],parts[1])
Run Code Online (Sandbox Code Playgroud)
提示:如果要使用BeautifulSoup解析HTML,
则应具有某些属性,如class=input或id=10,即,将相同类型的所有标记保留为相同的id或类。
更新
一下,以便发表您的评论,请参见下面的代码。
它采用了上面的技巧,使工作(和编码)更加容易
from bs4 import BeautifulSoup as bs
c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
<p>
No. 4<br>
Private Drive,<br>
Sri Lanka ON K7L LK <br>
"""
soup = bs(data)
for i in soup.find_all('div'):
# get data using "class" attribute
addr = ""
if i.get("class")[0] == u'address': # unicode string
text = i.get_text()
for line in text.splitlines(): # line-wise
line = line.strip() # remove whitespace
addr += line # add to address string
c_addr.append(addr)
# get data using "id" attribute
addr = ""
if int(i.get("id")) == 10: # integer
text = i.get_text()
# same processing as above
for line in text.splitlines():
line = line.strip()
addr += line
id_addr.append(addr)
print "id_addr"
print id_addr
print "c_addr"
print c_addr
Run Code Online (Sandbox Code Playgroud)
小智 5
您正在调用 read() 两次。那就是问题所在。您不想调用 read 一次,而是将数据存储在一个变量中,然后在调用 read() 的地方使用该变量。像这样的东西:
fetched_data = source.read()
Run Code Online (Sandbox Code Playgroud)
然后后来...
givenName=(fetched_data.split(start))[1].split(end)[0]
Run Code Online (Sandbox Code Playgroud)
和...
surname=(fetched_data.split(start))[1].split(end)[0]
Run Code Online (Sandbox Code Playgroud)
那应该工作。您的代码不起作用的原因是 read() 方法第一次读取内容,但在读取完成后,它正在查看内容的结尾。下次调用 read() 时,它没有剩余的内容并抛出异常。