如何在获取 Beautiful Soup 元素的 .string 时忽略标签？

Question

如何在获取 Beautiful Soup 元素的 .string 时忽略标签？

use*_*016 3 python dom beautifulsoup html-parsing

我正在处理具有子标签的 HTML 元素，我想“忽略”或删除这些标签，以便文本仍然存在。刚才，如果我尝试.string任何带有标签的元素，我得到的只是None.

import bs4

soup = bs4.BeautifulSoup("""
    <div id="main">
      <p>This is a paragraph.</p>
      <p>This is a paragraph <span class="test">with a tag</span>.</p>
      <p>This is another paragraph.</p>
    </div>
""")

main = soup.find(id='main')
for child in main.children:
    print child.string

Run Code Online (Sandbox Code Playgroud)

输出：

This is a paragraph.
None
This is another paragraph.

Run Code Online (Sandbox Code Playgroud)

我希望第二行是This is a paragraph with a tag.. 我该怎么做呢？

Answer 1

ale*_*cxe 5

for child in soup.find(id='main'):
    if isinstance(child, bs4.Tag):
        print child.text

Run Code Online (Sandbox Code Playgroud)

而且，你会得到：

This is a paragraph.
This is a paragraph with a tag.
This is another paragraph.

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，4 月前
查看次数：	3538 次
最近记录：	12 年，4 月前