使用BeautifulSoup基于内容值提取标记内容

Question

使用BeautifulSoup基于内容值提取标记内容

Gop*_*pal 3 python beautifulsoup html-content-extraction

我有一个以下格式的Html文档.

<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>

Run Code Online (Sandbox Code Playgroud)

我想提取段落标记的内容,包括斜体和粗体标记的内容,但不包含锚标记的内容.此外,可能在开头忽略数字.

预期的输出是:段落的内容用斜体但不强.

最好的方法是什么？

此外,以下代码片段返回TypeError:类型为"NoneType"的参数不可迭代

soup = BSoup(page)
for p in soup.findAll('p'):
    if '&nbsp;&nbsp;&nbsp;' in p.string:
        print p

Run Code Online (Sandbox Code Playgroud)

谢谢你的建议.

Answer 1

sou*_*eck 5

您的代码失败,因为tag.string如果标记只有一个子节点且该子节点是,则设置NavigableString

您可以通过提取a标记来实现您想要的效果:

from BeautifulSoup import BeautifulSoup

s = """<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>"""
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)

for p in soup.findAll('p'):
    for a in p.findAll('a'):
        a.extract()
    print ''.join(p.findAll(text=True))

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，10 月前
查看次数：	2654 次
最近记录：	12 年，7 月前