beautifulsoup - 在div中提取链接

Question

beautifulsoup - 在div中提取链接

Mer*_*na 2 python screen-scraping beautifulsoup

我有一个像以下内容的汤

很多div,我感兴趣的人都有"foo"类

在每个div中,有很多链接和其他内容,我对第二个链接感兴趣(第二个<a> </a>)=>它总是第二个我想要获取链接(在href属性中)和第二个链接标记之间的文本<a> </a>

例如 :

<div class ="foo">
     <a href ="http://example.com"> </a>
     <a href ="http://example2.com"> Title here </a>
</div>

<div class ="foo">
     <a href ="http://example3.com"> </a>
     <a href ="http://example4.com"> Title 2 here </a>
</div>

Run Code Online (Sandbox Code Playgroud)

在这里,我想得到:

这里标题=> http://example2.com

标题2在这里=> http://example4.com

我试过写一些代码:

soup.findAll("div", { "class" : "foo" })

Run Code Online (Sandbox Code Playgroud)

但是这会返回一个包含所有div及其内容的列表,我不知道如何更进一步

谢谢 :)

Answer 1

fal*_*tru 9

迭代divs并找到a那里.

from bs4 import BeautifulSoup

example = '''
<div class ="foo">
     <a href ="http://example.com"> </a>
     <a href ="http://example2.com"> Title here </a>
</div>

<div class ="foo">
     <a href ="http://example3.com"> </a>
     <a href ="http://example4.com"> Title 2 here </a>
'''

soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'foo'}):
    a = div.findAll('a')[1]
    print a.text.strip(), '=>', a.attrs['href']

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，9 月前
查看次数：	5901 次
最近记录：	12 年，9 月前