使用Beautiful Soup按类名获取内容

Raj*_*eev 14 python beautifulsoup

使用Beautiful Soup模块,如何获取div类名为feeditemcontent cxfeeditemcontent?的标签的数据?是吗:

soup.class['feeditemcontent cxfeeditemcontent']
Run Code Online (Sandbox Code Playgroud)

要么:

soup.find_all('class')
Run Code Online (Sandbox Code Playgroud)

这是HTML源:

<div class="feeditemcontent cxfeeditemcontent">
    <div class="feeditembodyandfooter">
         <div class="feeditembody">
         <span>The actual data is some where here</span>
         </div>
     </div>
 </div> 
Run Code Online (Sandbox Code Playgroud)

这是Python代码:

 from BeautifulSoup import BeautifulSoup
 html_doc = open('home.jsp.html', 'r')

 soup = BeautifulSoup(html_doc)
 class="feeditemcontent cxfeeditemcontent"
Run Code Online (Sandbox Code Playgroud)

Leo*_*son 22

Beautiful Soup 4将"class"属性的值视为列表而不是字符串,这意味着jadkik94的解决方案可以简化:

from bs4 import BeautifulSoup                                                   

def match_class(target):                                                        
    def do_match(tag):                                                          
        classes = tag.get('class', [])                                          
        return all(c in classes for c in target)                                
    return do_match                                                             

soup = BeautifulSoup(html)                                                      
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))
Run Code Online (Sandbox Code Playgroud)


jad*_*k94 10

试试这个,也许这对于这个简单的事情来说太过分了,但它有效:

def match_class(target):
    target = target.split()
    def do_match(tag):
        try:
            classes = dict(tag.attrs)["class"]
        except KeyError:
            classes = ""
        classes = classes.split()
        return all(c in classes for c in target)
    return do_match

html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)

matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
    print m
    print "-"*10

matches = soup.findAll(match_class("feeditembody"))
for m in matches:
    print m
    print "-"*10
Run Code Online (Sandbox Code Playgroud)

  • `classes = dict(tag.attrs).get('class','')`比`try``除了`块短得多,它的功能是相同的. (4认同)

Azi*_*lto 6

soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")

所以,如果我想<div class="header">从stackoverflow.com 获取类头的所有div标签,那么BeautifulSoup的示例将是:

from bs4 import BeautifulSoup as bs
import requests 

url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)

tags = soup.findAll("div", class_="header")
Run Code Online (Sandbox Code Playgroud)

它已经在bs4 文档中了.


小智 5

from BeautifulSoup import BeautifulSoup 
f = open('a.htm')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'id':'abc def'})
print list
Run Code Online (Sandbox Code Playgroud)