Ale*_*ley 1 python beautifulsoup html-parsing web-scraping
我试图从这个维基百科页面刮掉有生日的人
这是现有的代码:
hdr = {'User-Agent': 'Mozilla/5.0'}
site = "http://en.wikipedia.org/wiki/"+"january"+"_"+"1"
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup
Run Code Online (Sandbox Code Playgroud)
这一切都很好,我得到整个HTML页面,但我想要特定的数据,我不知道如何使用没有id使用的Beautiful Soup访问它.该<ul>标签没有一个id也不做<li>标记.另外,我不能只询问每个<li>标签,因为页面上还有其他列表.是否有特定方式来调用给定列表?(我不能只为这一页使用修复程序,因为我计划迭代所有日期并让每个页面生日,我不能保证每个页面都与此页面完全相同).
我们的想法是获取spanwith Birthsid,找到父亲的下一个兄弟(即ul)并迭代它的li元素.这是一个完整的例子requests(尽管它不相关):
from bs4 import BeautifulSoup as Soup, Tag
import requests
response = requests.get("http://en.wikipedia.org/wiki/January_1")
soup = Soup(response.content)
births_span = soup.find("span", {"id": "Births"})
births_ul = births_span.parent.find_next_sibling()
for item in births_ul.findAll('li'):
if isinstance(item, Tag):
print item.text
Run Code Online (Sandbox Code Playgroud)
打印:
871 – Zwentibold, Frankish son of Arnulf of Carinthia (d. 900)
1431 – Pope Alexander VI (d. 1503)
1449 – Lorenzo de' Medici, Italian politician (d. 1492)
1467 – Sigismund I the Old, Polish king (d. 1548)
1484 – Huldrych Zwingli, Swiss pastor and theologian (d. 1531)
1511 – Henry, Duke of Cornwall (d. 1511)
1516 – Margaret Leijonhufvud, Swedish wife of Gustav I of Sweden (d. 1551)
...
Run Code Online (Sandbox Code Playgroud)
希望有所帮助.
找到出生部分:
section = soup.find('span', id='Births').parent
Run Code Online (Sandbox Code Playgroud)
然后找到下一个无序列表:
births = section.find_next('ul').find_all('li')
Run Code Online (Sandbox Code Playgroud)