BeautifulSoup可以找到非官方的HTML标签/属性

Question

BeautifulSoup可以找到非官方的HTML标签/属性

在我的工作中,我们使用的是我们创建的标签.其中一个标签叫做can-edit,它在代码中看起来像这样(例如):

<h1 can-edit="banner top text" class="mainText">some text</h1>
<h2 can-edit="banner bottom text" class="bottomText">some text</h2>

Run Code Online (Sandbox Code Playgroud)

它可以在任何标签内(img,p,h1,h2,div ......).

我希望得到的是页面中的所有可编辑标签,例如上面的HTML:

['banner top text', 'banner bottom text']

Run Code Online (Sandbox Code Playgroud)

我试过了

soup = BeautifulSoup(html, "html.parser")
can_edits = soup.find_all("can-edit")

Run Code Online (Sandbox Code Playgroud)

但它找不到任何东西.

Answer 1

Wil*_*sem 6

我试过了
soup = BeautifulSoup(html, "html.parser")
can_edits = soup.find_all("can-edit")
Run Code Online (Sandbox Code Playgroud)
但它找不到任何东西.

这不起作用的原因是因为在这里你寻找一个带有名称的标签can-edit,所以<can-edit ...>,这样就不起作用了.

您可以使用find_all汤的功能查找具有特定属性的所有标签.例如:

soup.find_all(attrs={'can-edit': True})

Run Code Online (Sandbox Code Playgroud)

所以在这里我们使用attrs参数并传递一个属性,该属性表示我们过滤具有can-edit 属性的标签.这将为我们提供一个带有can-edit属性的标签列表(无论值如何).如果我们现在想要获取该属性的值,我们可以获取['can-edit']它的项,因此我们可以编写列表理解:

all_can_edit_attrs = [tag['can-edit']
                      for tag in soup.find_all(attrs={'can-edit': True})]

Run Code Online (Sandbox Code Playgroud)

或完整的工作版本:

from bs4 import BeautifulSoup

s = """<h1 can-edit="banner top text" class="mainText">some text</h1>
<h2 can-edit="banner bottom text" class="bottomText">some text</h2>"""

bs = BeautifulSoup(s, 'lxml')

all_can_edit_attrs = [tag['can-edit']
                      for tag in soup.find_all(attrs={'can-edit': True})]

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年前
查看次数：	49 次
最近记录：	8 年前