使用 Python BeautifulSoup 从网页中抓取没有 id 或 class 的元素

Question

使用 Python BeautifulSoup 从网页中抓取没有 id 或 class 的元素

如果元素具有 id 或类，我知道如何从网页中抓取数据。

例如，这里soup是一个 BeautifulSoup 对象。

for item in soup.findAll('a',{"class":"class_name"}):
    title = item.string
    print(title+"\n")

Run Code Online (Sandbox Code Playgroud)

如果元素没有 id 或 class，我们如何做到这一点？例如，没有 id 或 class 的段落元素。

或者在更糟糕的情况下，如果我们只需要抓取一些像下面这样的纯文本会发生什么？

<body>
<p>YO!</p>
hello world!!
</body>

Run Code Online (Sandbox Code Playgroud)

例如，如何仅hello world!!在上述页面源中打印？它没有 id 或 class。

Answer 1

ale*_*cxe 5

如果你想定位一个没有定义id和class属性的元素：

soup.find("p", class_=False, id=False)

Run Code Online (Sandbox Code Playgroud)

要像hello world!!在您的示例中一样定位“文本”节点，您可以通过文本本身获取它 - 通过部分匹配或正则表达式匹配：

import re

soup.find(text=re.compile("^hello"))  # find text starting with "hello"
soup.find(text="hello world!!")  # find text with an exact "hello world!!" text
soup.find(text=lambda text: text and "!!" in text)  # find text havin "!!" inside it

Run Code Online (Sandbox Code Playgroud)

或者，您可以找到前面的p元素并获取下一个文本节点：

soup.find("p", class_=False, id=False).find_next_sibling(text=True)
soup.find("p", text="YO!").find_next_sibling(text=True)

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，11 月前
查看次数：	9371 次
最近记录：	9 年，11 月前