Cod*_*alk 4 python django parsing beautifulsoup html-parsing
谁能帮我用美丽的汤遍历HTML树?
我试图通过html输出解析,并收集每个值后,然后插入到Tld以python / django 命名的表中
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
Run Code Online (Sandbox Code Playgroud)
And only parse the value of href attribute of <a>, so only this part:
https://billing.anapp.com/
Run Code Online (Sandbox Code Playgroud)
of:
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
Run Code Online (Sandbox Code Playgroud)
I currently have:
for url in urls:
mb.open(url)
beautifulSoupObj = BeautifulSoup(mb.response().read())
beautifulSoupObj.find_all('h3',attrs={'class': 'r'})
Run Code Online (Sandbox Code Playgroud)
The problem is find_all above, isn't make it far enough to the <a> element.
Any help is much appreciated. Thank you.
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
"""
bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
print(i.attrs["href"])
Run Code Online (Sandbox Code Playgroud)
prints:
https://billing.anapp.com/
Run Code Online (Sandbox Code Playgroud)
h3.r a is a css selector
you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.
this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.
for elm in bs.find_all('h3',attrs={'class': 'r'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6141 次 |
| 最近记录: |