Mag*_*gie 7 html beautifulsoup web-scraping
试图从这样的东西中抓取一些HTML.有时我需要的数据是div [0],有时是div [1]等.
想象一下,每个人都需要3-5节课.其中之一就是生物学.他们的成绩单总是按字母顺序排列.我想要每个人的生物学等级.
我已经把所有这些HTML都写成了文本,现在如何剔除生物学成绩?
<div class = "student">
<div class = "score">Algebra C-</div>
<div class = "score">Biology A+</div>
<div class = "score">Chemistry B</div>
</div>
<div class = "student">
<div class = "score">Biology B</div>
<div class = "score">Chemistry A</div>
</div>
<div class = "student">
<div class = "score">Alchemy D</div>
<div class = "score">Algebra A</div>
<div class = "score">Biology B</div>
</div>
<div class = "student">
<div class = "score">Algebra A</div>
<div class = "score">Biology B</div>
<div class = "score">Chemistry C+</div>
</div>
<div class = "student">
<div class = "score">Alchemy D</div>
<div class = "score">Algebra A</div>
<div class = "score">Bangladeshi History C</div>
<div class = "score">Biology B</div>
</div>
Run Code Online (Sandbox Code Playgroud)
我正在使用美丽的汤,我想我将不得不找到文本包含"生物学"的div?
这只是为了快速刮,我开放硬编码和摆弄Excel或诸如此类的东西.是的,这是一个伪劣的网站!是的,他们确实有一个API,我不知道有关WDSL的事情.
简短版本:http://www.legis.ga.gov/Legislation/en-US/Search.aspx,查找每个账单上的最后一次行动的日期,FWIW.这很麻烦,因为如果一个法案在第二个议案中没有赞助者,而不是一个不包含任何内容的div,那么他们根本就没有一个div.所以有时时间线是div 3,有时是2,等等.
B.M*_*.W. 10
(1)仅仅获得生物学等级,它几乎是一个班轮.
import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology'))
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores
Run Code Online (Sandbox Code Playgroud)
输出如下所示:
[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']
Run Code Online (Sandbox Code Playgroud)
(2)您找到了标签,也许还有其他任务,您需要找到parent:
import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs
Run Code Online (Sandbox Code Playgroud)
输出如下:
[<div class="score">Biology A+</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>]
Run Code Online (Sandbox Code Playgroud)
*总之,您可以使用find_siblings/parent/...等来移动HTML树.*
有关如何导航树的更多信息.祝你工作顺利.
另一种方法(使用 css 选择器)是:
divs = soup.select('div:-soup-contains("Biology")')
编辑:
需要BeautifulSoup4 4.7.0+ (SoupSieve)
| 归档时间: |
|
| 查看次数: |
7010 次 |
| 最近记录: |