获取没有内部子标签文本的HTML标签文本

Use*_*ser 3 python beautifulsoup python-2.7

例:

有时HTML是:

<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>
Run Code Online (Sandbox Code Playgroud)

其他时候它只是:

<div id="1">
    this is the text i want here
</div>
Run Code Online (Sandbox Code Playgroud)

我想只获取一个标签中的文本,并忽略所有其他子标签.如果我经营这家.text酒店,我会得到两个.

mha*_*wke 7

已更新以使用更通用的方法(请参阅编辑历史记录以获取原始答案):

您可以通过测试它们是否为实例来提取外部div的子元素NavigableString.

from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>'''

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]
Run Code Online (Sandbox Code Playgroud)

这导致外部div元素中包含的字符串列表.

>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'
Run Code Online (Sandbox Code Playgroud)

对于你的第二个例子:

html = '''<div id="1">
    this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']
Run Code Online (Sandbox Code Playgroud)

这也适用于其他情况,例如外部div的文本元素在任何子标记之前,子标记之间,多个文本元素之间或根本不存在.


har*_*r07 7

另一种可能的方法(我将其放在函数中):

def getText(parent):
    return ''.join(parent.find_all(text=True, recursive=False)).strip()
Run Code Online (Sandbox Code Playgroud)

recursive=False表示您只需要直接子级,而不需要嵌套子级。并text=True表示您只需要文本节点。

用法示例:

from bs4 import BeautifulSoup

html = """<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>
"""
soup = BeautifulSoup(html)
print(getText(soup.div))
#this is the text i want here
Run Code Online (Sandbox Code Playgroud)