gec*_*kon 6 html python beautifulsoup html-parsing
我有一个我需要处理的HTML文档.我正在使用'beautifoulsoup'.现在我想从该文档中检索一些"subsoups"并将它们加入一个汤中,以便稍后我可以将它用作期望汤对象的函数的参数.
如果不清楚,我会给你一个例子......
from bs4 import BeautifulSoup
my_document = """
<html>
<body>
<h1>Some Heading</h1>
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
</body>
</html>
"""
soup = BeautifulSoup(my_document)
# find the needed parts
first = soup.find("div", {"id": "first"})
third = soup.find("div", {"id": "third"})
loner = soup.find("p", {"id": "loner"})
subsoups = [first, third, loner]
# create a new (sub)soup
resulting_soup = do_some_magic(subsoups)
# use it in a function that expects a soup object and calls its methods
function_expecting_a_soup(resulting_soup)
Run Code Online (Sandbox Code Playgroud)
目标是让一个对象在resulting_soup/表现得像一个汤,其中包含以下内容:
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
Run Code Online (Sandbox Code Playgroud)
有没有方便的方法呢?如果有更好的方法来检索"子" find(),我可以使用它.谢谢.
更新
Wondercricket建议的解决方案是连接包含找到的标签的字符串,并将它们再次解析为一个新的BeautifulSoup对象.虽然这是解决问题的一种可能方法,但重新解析可能需要比我想要的时间更长,特别是当我想要检索它们中的大多数时,我需要处理许多这样的文档.find()返回一个bs4.element.Tag.有没有办法如何将多个Tags 连接成一个汤而不将Tags转换为字符串并解析字符串?
SoupStrainer 将完全按照您的要求进行操作,并且,作为一项奖励,您将获得性能上的提升,因为它可以完全解析您想要解析的内容-而不是完整的文档树:
from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer(id=["first", "third", "loner"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
Run Code Online (Sandbox Code Playgroud)
现在,该soup对象将仅包含所需的元素:
<div id="first">
<p>
A paragraph.
</p>
<a href="another_doc.html">
A link
</a>
<p>
A paragraph.
</p>
</div>
<div id="third">
<p>
A paragraph.
</p>
<a href="another_doc.html">
A link
</a>
<a href="yet_another_doc.html">
A link
</a>
</div>
<p id="loner">
A paragraph.
</p>
Run Code Online (Sandbox Code Playgroud)
不仅可以指定ID,还可以指定标签吗?例如,如果我想过滤所有带有class =“ someclass的段落,而不是具有相同类别的divs?
在这种情况下,您可以使搜索功能加入以下多个条件SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer, ResultSet
my_document = """
<html>
<body>
<h1>Some Heading</h1>
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
<p class="myclass">test</p>
</body>
</html>
"""
def search(tag, attrs):
if tag == "p" and "myclass" in attrs.get("class", []):
return tag
if attrs.get("id") in ["first", "third", "loner"]:
return tag
parse_only = SoupStrainer(search)
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
print(soup.prettify())
Run Code Online (Sandbox Code Playgroud)