Use*_*ser 3 python beautifulsoup python-2.7
例:
有时HTML是:
<div id="1">
<div id="2">
this is the text i do NOT want
</div>
this is the text i want here
</div>
Run Code Online (Sandbox Code Playgroud)
其他时候它只是:
<div id="1">
this is the text i want here
</div>
Run Code Online (Sandbox Code Playgroud)
我想只获取一个标签中的文本,并忽略所有其他子标签.如果我经营这家.text酒店,我会得到两个.
已更新以使用更通用的方法(请参阅编辑历史记录以获取原始答案):
您可以通过测试它们是否为实例来提取外部div的子元素NavigableString.
from bs4 import BeautifulSoup, NavigableString
html = '''<div id="1">
<div id="2">
this is the text i do NOT want
</div>
this is the text i want here
</div>'''
soup = BeautifulSoup(html)
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]
Run Code Online (Sandbox Code Playgroud)
这导致外部div元素中包含的字符串列表.
>>> inner_text
[u'\n', u'\n this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n this is the text i want here\n'
Run Code Online (Sandbox Code Playgroud)
对于你的第二个例子:
html = '''<div id="1">
this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]
>>> inner_text
[u'\n this is the text i want here\n']
Run Code Online (Sandbox Code Playgroud)
这也适用于其他情况,例如外部div的文本元素在任何子标记之前,子标记之间,多个文本元素之间或根本不存在.
另一种可能的方法(我将其放在函数中):
def getText(parent):
return ''.join(parent.find_all(text=True, recursive=False)).strip()
Run Code Online (Sandbox Code Playgroud)
recursive=False表示您只需要直接子级,而不需要嵌套子级。并text=True表示您只需要文本节点。
用法示例:
from bs4 import BeautifulSoup
html = """<div id="1">
<div id="2">
this is the text i do NOT want
</div>
this is the text i want here
</div>
"""
soup = BeautifulSoup(html)
print(getText(soup.div))
#this is the text i want here
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3985 次 |
| 最近记录: |