BeautifulSoup webscraping find_all():找到完全匹配

use*_*815 17 html python regex beautifulsoup web-scraping

我正在使用Python和BeautifulSoup进行网页抓取.

让我说我有以下HTML代码来刮:

<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>
Run Code Online (Sandbox Code Playgroud)

使用BeautifulSoup,我想找到属性class ="product"(仅限产品1和2)的产品,而不是'特殊'产品

如果我执行以下操作:

result = soup.find_all('div', {'class': 'product'})
Run Code Online (Sandbox Code Playgroud)

结果包括所有产品(1,2,3和4).

如何找到类别与"产品"完全匹配的产品?


我运行的代码:

from bs4 import BeautifulSoup
import re

text = """
<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>"""

soup = BeautifulSoup(text)
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
print result
Run Code Online (Sandbox Code Playgroud)

输出:

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]
Run Code Online (Sandbox Code Playgroud)

Mar*_*ers 38

在BeautifulSoup 4中,class属性(以及其他几个属性,例如表格单元格元素accesskeyheaders属性)被视为一个集合; 您匹配属性中列出的各个元素.这符合HTML标准.

因此,您不能将搜索限制为只有一个类.

你必须在这里使用自定义函数来代替类:

result = soup.find_all(lambda tag: tag.name == 'div' and 
                                   tag.get('class') == ['product'])
Run Code Online (Sandbox Code Playgroud)

我用a lambda来创建一个匿名函数; 每个标记在名称上匹配(必须是'div'),并且类属性必须与列表完全相同['product']; 例如,只有一个值.

演示:

>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
...     <div class="product">Product 1</div>
...     <div class="product">Product 2</div>
...     <div class="product special">Product 3</div>
...     <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]
Run Code Online (Sandbox Code Playgroud)

为了完整起见,以下是BeautifulSoup源代码中的所有这些set属性:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
    "*" : ['class', 'accesskey', 'dropzone'],
    "a" : ['rel', 'rev'],
    "link" :  ['rel', 'rev'],
    "td" : ["headers"],
    "th" : ["headers"],
    "td" : ["headers"],
    "form" : ["accept-charset"],
    "object" : ["archive"],

    # These are HTML5 specific, as are *.accesskey and *.dropzone above.
    "area" : ["rel"],
    "icon" : ["sizes"],
    "iframe" : ["sandbox"],
    "output" : ["for"],
    }
Run Code Online (Sandbox Code Playgroud)

  • 终于有一个有效的解决方案了!!我有两个类要匹配,并且正在使用 `soup.find_all('div', {'class': ['class1','class2']})` 但它也采用了只有 `class2` 的 `div` 。它正在做我所期望的事情。不知道为什么我使用的那个不起作用...... (2认同)

cru*_*nch 5

你可以像这样使用 CSS 选择器:

result = soup.select('div.product.special')
Run Code Online (Sandbox Code Playgroud)

css-选择器