如何使用Beautiful Soup选择div文本内容?

Mag*_*gie 7 html beautifulsoup web-scraping

试图从这样的东西中抓取一些HTML.有时我需要的数据是div [0],有时是div [1]等.

想象一下,每个人都需要3-5节课.其中之一就是生物学.他们的成绩单总是按字母顺序排列.我想要每个人的生物学等级.

我已经把所有这些HTML都写成了文本,现在如何剔除生物学成绩?

<div class = "student">
    <div class = "score">Algebra C-</div>
    <div class = "score">Biology A+</div>
    <div class = "score">Chemistry B</div>
</div>
<div class = "student">
    <div class = "score">Biology B</div>
    <div class = "score">Chemistry A</div>
</div>
<div class = "student">
    <div class = "score">Alchemy D</div>
    <div class = "score">Algebra A</div>
    <div class = "score">Biology B</div>
</div>
<div class = "student">
    <div class = "score">Algebra A</div>
    <div class = "score">Biology B</div>
    <div class = "score">Chemistry C+</div>
</div>
<div class = "student">
    <div class = "score">Alchemy D</div>
    <div class = "score">Algebra A</div>
    <div class = "score">Bangladeshi History C</div>
    <div class = "score">Biology B</div>
</div>
Run Code Online (Sandbox Code Playgroud)

我正在使用美丽的汤,我想我将不得不找到文本包含"生物学"的div?

这只是为了快速刮,我开放硬编码和摆弄Excel或诸如此类的东西.是的,这是一个伪劣的网站!是的,他们确实有一个API,我不知道有关WDSL的事情.

简短版本:http://www.legis.ga.gov/Legislation/en-US/Search.aspx,查找每个账单上的最后一次行动的日期,FWIW.这很麻烦,因为如果一个法案在第二个议案中没有赞助者,而不是一个不包含任何内容的div,那么他们根本就没有一个div.所以有时时间线是div 3,有时是2,等等.

B.M*_*.W. 10

(1)仅仅获得生物学等级,它几乎是一个班轮.

import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology')) 
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores
Run Code Online (Sandbox Code Playgroud)

输出如下所示:

[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']
Run Code Online (Sandbox Code Playgroud)

(2)您找到了标签,也许还有其他任务,您需要找到parent:

import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs
Run Code Online (Sandbox Code Playgroud)

输出如下:

[<div class="score">Biology A+</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>]
Run Code Online (Sandbox Code Playgroud)

*总之,您可以使用find_siblings/parent/...等来移动HTML树.*

有关如何导航树的更多信息.祝你工作顺利.


Ana*_*nov 8

另一种方法(使用 css 选择器)是:

divs = soup.select('div:-soup-contains("Biology")')

编辑:

需要BeautifulSoup4 4.7.0+ (SoupSieve)