Ank*_*kur 5 python beautifulsoup python-3.x
我是 Python 和 BeautifulSoup4 的新手
我试图(仅)提取所有“div”、“p”、“li”标签的文本内容,并且仅从直接节点而不是子节点中提取 - 因此有两个选项text=True, recursive=False
这些是我的尝试:
content = soup.find_all("b", "div", "p", text=True, recursive=False)
Run Code Online (Sandbox Code Playgroud)
和
tags = ["div", "p", "li"]
content = soup.find_all(tags, text=True, recursive=False)
Run Code Online (Sandbox Code Playgroud)
这两个都没有给我任何输出,你知道我做错了什么吗?
编辑 - 添加更多代码和我正在测试的示例文档print(content)是空的
import requests
from bs4 import BeautifulSoup
url = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, "html.parser")
tags = ["div", "p", "li"]
content = soup.find_all(tags, text=True, recursive=False)
print(content)
Run Code Online (Sandbox Code Playgroud)
从您的问题和对之前答案的评论中,我认为您正在尝试找到
\n\n\n\n\n\n
\n- \n
最里面的标签
- \n
是 \'p\' 或 \'li\' 或 \'div\'
- \n
应该包含一些文字
import requests\nfrom bs4 import BeautifulSoup\nfrom bs4 import NavigableString\n\nurl = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list"\nresponse = requests.get(url, headers={\'User-Agent\': \'Mozilla/5.0\'})\n\nsoup = BeautifulSoup(response.text, "html.parser")\ndef end_node(tag):\n if tag.name not in ["div", "p", "li"]:\n return False\n if isinstance(tag,NavigableString): #if str return\n return False\n if not tag.text: #if no text return false\n return False\n elif len(tag.find_all(text=False)) > 0: #no other tags inside other than text\n return False\n return True #if valid it reaches here\ncontent = soup.find_all(end_node)\nprint(content) #all end nodes matching our criteria\nRun Code Online (Sandbox Code Playgroud)\n\n输出示例
\n\n[<p>These instructions illustrate all major features of Beautiful Soup 4,\nwith examples. I show you what the library is good for, how it works,\nhow to use it, how to make it do what you want, and what to do when it\nviolates your expectations.</p>, <p>The examples in this documentation should work the same way in Python\n2.7 and Python 3.2.</p>, <p>This documentation has been translated into other languages by\nBeautiful Soup users:</p>, <p>Here are some simple ways to navigate that data structure:</p>, <p>One common task is extracting all the URLs found within a page\xe2\x80\x99s <a> tags:</p>, <p>Another common task is extracting all the text from a page:</p>, <p>Does this look like what you need? If so, read on.</p>, <p>If you\xe2\x80\x99re using a recent version of Debian or Ubuntu Linux, you can\ninstall Beautiful Soup with the system package manager:</p>, <p>I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it\nshould work with other recent versions.</p>, <p>Beautiful Soup is packaged as Python 2 code. When you install it for\nuse with Python 3, it\xe2\x80\x99s automatically converted to Python 3 code. If\nyou don\xe2\x80\x99t install the package, the code won\xe2\x80\x99t be converted. There have\nalso been reports on Windows machines of the wrong version being\ninstalled.</p>, <p>In both cases, your best bet is to completely remove the Beautiful\nSoup installation from your system (including any directory created\nwhen you unzipped the tarball) and try the installation again.</p>, <p>This table summarizes the advantages and disadvantages of each parser library:</p>, <li>Batteries included</li>, <li>Decent speed</li>, \n....\n]\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
2064 次 |
| 最近记录: |