BeautifulSoup排除某个标签内的内容

Question

BeautifulSoup排除某个标签内的内容

Dav*_*542 4 html python lxml beautifulsoup html-parsing

我有以下项目来查找段落中的文本:

soup.find("td", { "id" : "overview-top" }).find("p", { "itemprop" : "description" }).text

Run Code Online (Sandbox Code Playgroud)

如何排除<a>标签中的所有文字？有点像in <p> but not in <a>？

Answer 1

ale*_*cxe 5

查找并加入标记中的所有文本节点,p并检查它的父级是否不是a标记:

p = soup.find("td", {"id": "overview-top"}).find("p", {"itemprop": "description"})

print ''.join(text for text in p.find_all(text=True) 
              if text.parent.name != "a")

Run Code Online (Sandbox Code Playgroud)

演示(见无link text印刷):

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <td id="overview-top">
...     <p itemprop="description">
...         text1
...         <a href="google.com">link text</a>
...         text2
...     </p>
... </td>
... """
>>> soup = BeautifulSoup(data)
>>> p = soup.find("td", {"id": "overview-top"}).find("p", {"itemprop": "description"})
>>> print p.text

        text1
        link text
        text2
>>>
>>> print ''.join(text for text in p.find_all(text=True) if text.parent.name != "a")

        text1

        text2

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年前
查看次数：	4724 次
最近记录：	11 年前