使用Beautifulsoup和Mechanize从元素解析href属性值

Question

使用Beautifulsoup和Mechanize从元素解析href属性值

Cod*_*alk 4 python django parsing beautifulsoup html-parsing

谁能帮我用美丽的汤遍历HTML树？

我试图通过html输出解析，并收集每个值后，然后插入到Tld以python / django 命名的表中

<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>

Run Code Online (Sandbox Code Playgroud)

And only parse the value of href attribute of <a>, so only this part:

https://billing.anapp.com/

Run Code Online (Sandbox Code Playgroud)

of:

<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>

Run Code Online (Sandbox Code Playgroud)

I currently have:

for url in urls:
    mb.open(url)
    beautifulSoupObj = BeautifulSoup(mb.response().read())
    beautifulSoupObj.find_all('h3',attrs={'class': 'r'})

Run Code Online (Sandbox Code Playgroud)

The problem is find_all above, isn't make it far enough to the <a> element.

Any help is much appreciated. Thank you.

Answer 1

Foo*_*ser 6

from bs4 import BeautifulSoup

html = """
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
"""

bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
    print(i.attrs["href"])

Run Code Online (Sandbox Code Playgroud)

prints:

https://billing.anapp.com/

Run Code Online (Sandbox Code Playgroud)

h3.r a is a css selector

you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.

this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.

for elm in  bs.find_all('h3',attrs={'class': 'r'}):
    for a_elm in elm.find_all("a"):
        print(a_elm.attrs["href"])

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年前
查看次数：	6141 次
最近记录：	12 年前