Beautifulsoup 如何找到所有工作

Question

Beautifulsoup 如何找到所有工作

par*_*cer 3 html python beautifulsoup html-parsing

我注意到findAll's 方法的一些奇怪行为：

>>> htmls="<html><body><p class=\"pagination-container\">slytherin</p><p class=\"pagination-container and something\">gryffindor</p></body></html>"
>>> soup=BeautifulSoup(htmls, "html.parser")
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
    print(i.text)


slytherin
gryffindor
>>> for i in soup.findAll("p", {"class":"pag"}):
    print(i.text)


>>> for i in soup.findAll("p",{"class":"pagination-container"}):
    print(i.text)


slytherin
gryffindor
>>> for i in soup.findAll("p",{"class":"pagination"}):
    print(i.text)


>>> len(soup.findAll("p",{"class":"pagination-container"}))
2
>>> len(soup.findAll("p",{"class":"pagination-containe"}))
0
>>> len(soup.findAll("p",{"class":"pagination-contai"}))
0
>>> len(soup.findAll("p",{"class":"pagination-container and something"}))
1
>>> len(soup.findAll("p",{"class":"pagination-conta"}))
0

Run Code Online (Sandbox Code Playgroud)

因此，当我们搜索pagination-container它时，它会返回第一个和第二个p标签。这让我觉得它寻找部分平等：类似于if passed_string in class_attribute_value:. 所以我缩短了findAll方法中的字符串，它从来没有找到任何东西！

这怎么可能？

Answer 1

ale*_*cxe 5

首先，class是一个特殊的多值空间分隔属性，并有特殊处理。

编写时soup.findAll("p", {"class":"pag"})，BeautifulSoup会搜索具有 class 的元素pag。它将按空间拆分元素类值并检查拆分pag的项目中是否存在。如果您有一个带有class="test pag"或的元素class="pag"，它将被匹配。

应注意，在的情况下soup.findAll("p", {"class": "pagination-container and something"})，BeautifulSoup将匹配具有完全相同的元素class的属性值。在这种情况下不涉及拆分 - 它只是看到有一个元素的完整class值等于所需的字符串。

要对其中一个类进行部分匹配，您可以提供正则表达式或函数作为类过滤器值：

import re

soup.find_all("p", {"class": re.compile(r"pag")})  # contains pag
soup.find_all("p", {"class": re.compile(r"^pag")})  # starts with pag

soup.find_all("p", {"class": lambda class_: class_ and "pag" in class_})  # contains pag
soup.find_all("p", {"class": lambda class_: class_ and class_.startswith("pag")})  # starts with pag

Run Code Online (Sandbox Code Playgroud)

还有更多要说的，但您也应该知道它BeautifulSoup具有CSS 选择器支持（有限的一个，但涵盖了大多数常见用例）。你可以写这样的东西：

soup.select("p.pagination-container")  # one of the classes is "pagination-container"
soup.select("p[class='pagination-container']")  # match the COMPLETE class attribute value
soup.select("p[class^=pag]")  # COMPLETE class attribute value starts with pag

Run Code Online (Sandbox Code Playgroud)

处理class属性值BeautifulSoup是一个常见的混淆和问题的来源，请参阅这些相关主题以获得更多理解：

归档时间：	9 年，4 月前
查看次数：	5819 次
最近记录：	5 年，3 月前