如何用Beautiful Soup找到所有评论

Jos*_*eph 11 html python comments beautifulsoup bs4

四年前就提出了这个问题,但现在BS4的答案已经过时了.

我想用漂亮的汤删除我的html文件中的所有评论.由于BS4将每个注释作为一种特殊类型的可导航字符串,我认为这段代码可以工作:

for comments in soup.find_all('comment'):
     comments.decompose()
Run Code Online (Sandbox Code Playgroud)

所以这不起作用....如何使用BS4找到所有评论?

Fli*_*ght 13

您可以将函数传递给find_all()以帮助它检查字符串是否为Comment.

例如我在html下面:

<body>
   <!-- Branding and main navigation -->
   <div class="Branding">The Science &amp; Safety Behind Your Favorite Products</div>
   <div class="l-branding">
      <p>Just a brand</p>
   </div>
   <!-- test comment here -->
   <div class="block_content">
      <a href="https://www.google.com">Google</a>
   </div>
</body>
Run Code Online (Sandbox Code Playgroud)

码:

from bs4 import BeautifulSoup as BS
from bs4 import Comment
....
soup = BS(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
    print(c)
    print("===========")
    c.extract()
Run Code Online (Sandbox Code Playgroud)

输出将是:

Branding and main navigation 
============
test comment here
============
Run Code Online (Sandbox Code Playgroud)

顺便说一句,我认为之所以find_all('Comment')不起作用(来自BeautifulSoup文档):

传递一个名称的值,你会告诉Beautiful Soup只考虑具有某些名称的标签.将忽略文本字符串,名称不匹配的标记也将被忽略.


Jos*_*eph 10

我需要做的两件事:

首先,当导入美丽的汤

from bs4 import BeautifulSoup, Comment
Run Code Online (Sandbox Code Playgroud)

其次,这是提取注释的代码

for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
    comments.extract()
Run Code Online (Sandbox Code Playgroud)