如何从标记中获取文本,但忽略其他子标记

Question

如何从标记中获取文本,但忽略其他子标记

arc*_*123 6 python beautifulsoup python-3.x

我正在喝美味的汤.我有一个HTML字符串:

<div><b>ignore this</b>get this</div>

Run Code Online (Sandbox Code Playgroud)

如何检索"得到这个",而忽略" 忽略这个 "

谢谢

Answer 1

dre*_*cat 13

您可以获取div文本而不是递归检索子文本:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div><b>ignore this</b>get this</div>')
>>> soup.div.find(text=True, recursive=False)
u'get this'

Run Code Online (Sandbox Code Playgroud)

这与儿童的文本位置无关:

>>> soup = BeautifulSoup('<div>get this<b>ignore this</b></div>')
>>> soup.div.find(text=True, recursive=False)
u'get this'

Run Code Online (Sandbox Code Playgroud)

`find_all` 返回匹配列表。您需要将我的解决方案应用于每个匹配的 div。像这样的东西： `' '.join(div.find(text=True, recursive=False) for div in soup.findAll('div', 'sub'))`。如果需要，然后将所有文本连接到一个字符串中。 (3认同)
@AustinA 区别在于 html。它是为了展示 `recursive=False` 如何为您提供 `div` 的文本，忽略子元素，无论它们的位置如何。（谢谢，我在第二个片段中修复了“s”为“soup”的拼写错误:)） (2认同)

归档时间：	11 年，1 月前
查看次数：	5368 次
最近记录：	10 年，7 月前