alv*_*vas 3 html python nested beautifulsoup
如何使用 删除嵌套标签中的内容BeautifulSoup?这些帖子显示了反向检索嵌套标签中的内容:如何使用 BeautifulSoup和BeautifulSoup获取嵌套标签的内容:如何从包含一些嵌套 <ul 的 <ul> 列表中提取所有 <li> > 吗?
我试过了,.text但它只删除了标签
>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something blah blah something else'
Run Code Online (Sandbox Code Playgroud)
期望的输出:
别的东西别的东西
您可以检查bs4.element.NavigableString儿童:
from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
for item in elem.children:
if isinstance(item,bs4.element.NavigableString):
yield item
print ''.join(get_only_text(bs(html).find_all('foo')[0]))
Run Code Online (Sandbox Code Playgroud)
输出;
Something something something else
Run Code Online (Sandbox Code Playgroud)