使用BeautifulSoup查找包含特定文本的HTML标记

Question

使用BeautifulSoup查找包含特定文本的HTML标记

sot*_*ips 61 python regex beautifulsoup html-content-extraction

我正在尝试获取包含以下文本模式的HTML文档中的元素:#\ S {11}

<h2> this is cool #12345678901 </h2>

Run Code Online (Sandbox Code Playgroud)

所以,之前的匹配将使用:

soup('h2',text=re.compile(r' #\S{11}'))

Run Code Online (Sandbox Code Playgroud)

结果将是这样的:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

Run Code Online (Sandbox Code Playgroud)

我能够得到匹配的所有文本(见上面的行).但我希望文本的父元素匹配,因此我可以将其用作遍历文档树的起点.在这种情况下,我希望返回所有h2元素,而不是文本匹配.

想法？

Answer 1

nos*_*klo 71

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

Run Code Online (Sandbox Code Playgroud)

打印:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

Run Code Online (Sandbox Code Playgroud)

`.parent`太棒了!我从未想过这件事.谢谢@nosklo.+1 (2认同)

Answer 2

Bru*_*sky 19

BeautifulSoup搜索操作BeautifulSoup.NavigableString在text=用作标准时提供[一个] 对象列表,而不是BeautifulSoup.Tag在其他情况下.检查对象__dict__以查看可用的属性.在这些属性中,由于BS4的变化parent而受到青睐.previous

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

Run Code Online (Sandbox Code Playgroud)

Answer 3

T.C*_*tor 5

对于 bs4（美丽汤 4），OP 的尝试完全符合预期：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

Run Code Online (Sandbox Code Playgroud)

返回[<h2> this is cool #12345678901 </h2>]。

归档时间：	16 年，8 月前
查看次数：	72295 次
最近记录：	8 年前