Raj*_*eev 14 python beautifulsoup
使用Beautiful Soup模块,如何获取div类名为feeditemcontent cxfeeditemcontent?的标签的数据?是吗:
soup.class['feeditemcontent cxfeeditemcontent']
Run Code Online (Sandbox Code Playgroud)
要么:
soup.find_all('class')
Run Code Online (Sandbox Code Playgroud)
这是HTML源:
<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>
Run Code Online (Sandbox Code Playgroud)
这是Python代码:
from BeautifulSoup import BeautifulSoup
html_doc = open('home.jsp.html', 'r')
soup = BeautifulSoup(html_doc)
class="feeditemcontent cxfeeditemcontent"
Run Code Online (Sandbox Code Playgroud)
Leo*_*son 22
Beautiful Soup 4将"class"属性的值视为列表而不是字符串,这意味着jadkik94的解决方案可以简化:
from bs4 import BeautifulSoup
def match_class(target):
def do_match(tag):
classes = tag.get('class', [])
return all(c in classes for c in target)
return do_match
soup = BeautifulSoup(html)
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))
Run Code Online (Sandbox Code Playgroud)
jad*_*k94 10
试试这个,也许这对于这个简单的事情来说太过分了,但它有效:
def match_class(target):
target = target.split()
def do_match(tag):
try:
classes = dict(tag.attrs)["class"]
except KeyError:
classes = ""
classes = classes.split()
return all(c in classes for c in target)
return do_match
html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
print m
print "-"*10
matches = soup.findAll(match_class("feeditembody"))
for m in matches:
print m
print "-"*10
Run Code Online (Sandbox Code Playgroud)
soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")
所以,如果我想<div class="header">从stackoverflow.com 获取类头的所有div标签,那么BeautifulSoup的示例将是:
from bs4 import BeautifulSoup as bs
import requests
url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)
tags = soup.findAll("div", class_="header")
Run Code Online (Sandbox Code Playgroud)
它已经在bs4 文档中了.
小智 5
from BeautifulSoup import BeautifulSoup
f = open('a.htm')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'id':'abc def'})
print list
Run Code Online (Sandbox Code Playgroud)