使用BeautifulSoup在HTML注释之间提取文本

Question

使用BeautifulSoup在HTML注释之间提取文本

LAN*_*ark 4 python beautifulsoup web-scraping python-3.x

使用Python 3和BeautifulSoup 4，我希望能够从HTML页面中提取仅由其上方的注释描绘的文本。一个例子：

<\!--UNIQUE COMMENT-->
I would like to get this text
<\!--SECOND UNIQUE COMMENT-->
I would also like to find this text

Run Code Online (Sandbox Code Playgroud)

我找到了多种方法来提取页面的文本或评论，但没有办法完成我要寻找的事情。任何帮助将不胜感激。

Answer 1

Mar*_*ans 5

您只需要遍历所有可用注释，以查看它是否是必需的条目之一，然后显示以下元素的文本，如下所示：

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
        print comment.next_element.strip()

Run Code Online (Sandbox Code Playgroud)

这将显示以下内容：

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
        print comment.next_element.strip()

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年前
查看次数：	4601 次
最近记录：	10 年前