使用 BeautifulSoup4 查找包含文本的所有结束节点

Ank*_*kur 5 python beautifulsoup python-3.x

我是 Python 和 BeautifulSoup4 的新手

我试图(仅)提取所有“div”、“p”、“li”标签的文本内容,并且仅从直接节点而不是子节点中提取 - 因此有两个选项text=True, recursive=False

这些是我的尝试:

content = soup.find_all("b", "div", "p", text=True, recursive=False)
Run Code Online (Sandbox Code Playgroud)

tags = ["div", "p", "li"]
content = soup.find_all(tags, text=True, recursive=False)
Run Code Online (Sandbox Code Playgroud)

这两个都没有给我任何输出,你知道我做错了什么吗?

编辑 - 添加更多代码和我正在测试的示例文档print(content)是空的

import requests
from bs4 import BeautifulSoup

url = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(response.text, "html.parser")

tags = ["div", "p", "li"]
content = soup.find_all(tags, text=True, recursive=False)

print(content)
Run Code Online (Sandbox Code Playgroud)

Bit*_*han 4

从您的问题和对之前答案的评论中,我认为您正在尝试找到

\n\n
\n
    \n
  • 最里面的标签

  • \n
  • 是 \'p\' 或 \'li\' 或 \'div\'

  • \n
  • 应该包含一些文字

  • \n
\n
\n\n
import requests\nfrom bs4 import BeautifulSoup\nfrom bs4 import NavigableString\n\nurl = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"\nresponse = requests.get(url, headers={\'User-Agent\': \'Mozilla/5.0\'})\n\nsoup = BeautifulSoup(response.text, "html.parser")\ndef end_node(tag):\n    if tag.name not in ["div", "p", "li"]:\n        return False\n    if isinstance(tag,NavigableString): #if str return\n        return False\n    if not tag.text: #if no text return false\n        return False\n    elif len(tag.find_all(text=False)) > 0: #no other tags inside other than text\n        return False\n    return True #if valid it reaches here\ncontent = soup.find_all(end_node)\nprint(content) #all end nodes matching our criteria\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出示例

\n\n
[<p>These instructions illustrate all major features of Beautiful Soup 4,\nwith examples. I show you what the library is good for, how it works,\nhow to use it, how to make it do what you want, and what to do when it\nviolates your expectations.</p>, <p>The examples in this documentation should work the same way in Python\n2.7 and Python 3.2.</p>, <p>This documentation has been translated into other languages by\nBeautiful Soup users:</p>, <p>Here are some simple ways to navigate that data structure:</p>, <p>One common task is extracting all the URLs found within a page\xe2\x80\x99s &lt;a&gt; tags:</p>, <p>Another common task is extracting all the text from a page:</p>, <p>Does this look like what you need? If so, read on.</p>, <p>If you\xe2\x80\x99re using a recent version of Debian or Ubuntu Linux, you can\ninstall Beautiful Soup with the system package manager:</p>, <p>I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it\nshould work with other recent versions.</p>, <p>Beautiful Soup is packaged as Python 2 code. When you install it for\nuse with Python 3, it\xe2\x80\x99s automatically converted to Python 3 code. If\nyou don\xe2\x80\x99t install the package, the code won\xe2\x80\x99t be converted. There have\nalso been reports on Windows machines of the wrong version being\ninstalled.</p>, <p>In both cases, your best bet is to completely remove the Beautiful\nSoup installation from your system (including any directory created\nwhen you unzipped the tarball) and try the installation again.</p>, <p>This table summarizes the advantages and disadvantages of each parser library:</p>, <li>Batteries included</li>, <li>Decent speed</li>, \n....\n]\n
Run Code Online (Sandbox Code Playgroud)\n