如何使用Beautiful Soup查找带有特定文本的标签？

Question

如何使用Beautiful Soup查找带有特定文本的标签？

LA_*_*LA_ 28 html python beautifulsoup web-scraping

我有以下html(标记为\n的换行符):

...
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>
...

Run Code Online (Sandbox Code Playgroud)

如何找到我要找的文字？下面的代码返回第一个找到的值,所以我需要以某种方式过滤固定文本.

result = soup.find('td', {'class' :'pos'}).find('strong').text

Run Code Online (Sandbox Code Playgroud)

更新.如果我使用以下代码:

title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'})
self.response.out.write(str(title.string).decode('utf8'))

Run Code Online (Sandbox Code Playgroud)

然后它返回固定文本:.

Answer 1

小智 30

您可以将正则表达式传递给text参数findAll,如下所示:

import BeautifulSoup
import re

columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})

Run Code Online (Sandbox Code Playgroud)

Answer 2

Bru*_*sky 23

这篇文章让我得到了答案,尽管这篇文章中没有答案.我觉得我应该回馈.

这里的挑战在于BeautifulSoup.find使用和不使用文本进行搜索时的不一致行为.

注意: 如果您有BeautifulSoup,可以通过以下方式在本地测试:

curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

Run Code Online (Sandbox Code Playgroud)

代码: https ://gist.github.com/4060082

# Taken from https://gist.github.com/4060082
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint
import re

soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read())
# I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear.
pattern = re.compile('Fixed text')

# Peter's suggestion here returns a list of what appear to be strings
columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'})
# ...but it is actually a BeautifulSoup.NavigableString
print type(columns[0])
#>> <class 'BeautifulSoup.NavigableString'>

# you can reach the tag using one of the convenience attributes seen here
pprint(columns[0].__dict__)
#>> {'next': <br />,
#>>  'nextSibling': <br />,
#>>  'parent': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previous': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previousSibling': None}

# I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
# So, if you want to find the 'text' in the 'strong' element...
pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})])
#>> [u'text I am looking for']

# Here is what we have learned:
print soup.find('strong')
#>> <strong>some value</strong>
print soup.find('strong', text='some value')
#>> u'some value'
print soup.find('strong', text='some value').parent
#>> <strong>some value</strong>
print soup.find('strong', text='some value') == soup.find('strong')
#>> False
print soup.find('strong', text='some value') == soup.find('strong').text
#>> True
print soup.find('strong', text='some value').parent == soup.find('strong')
#>> True

Run Code Online (Sandbox Code Playgroud)

虽然对OP来说肯定来得太晚了,但我希望他们能够做到这一点,因为它确实满足了所有关于通过文本查找的窘境.

@BrunoBronosky,已经5年了,你仍然会回到你所做的这个文档.感谢您抽出时间来写这个.你真的很感激自己. (14认同)
@BrunoBronosky我知道,但谢谢你这么说. (7认同)

Answer 3

QHa*_*arr 10

在 bs4 4.7.1+ 中，您可以使用 :contains 伪类来指定td包含（过滤器）搜索字符串的内容。然后，您可以使用后代子组合器，在这种情况下，移动到strong包含目标文本：

from bs4 import BeautifulSoup as bs

html = '''
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>'''
soup = bs(html, 'lxml')
print(soup.select_one('td:contains("Fixed text:") strong').text)

Run Code Online (Sandbox Code Playgroud)

汤筛 2.1.0 以后：

新：为了避免与未来 CSS 规范更改发生冲突，非标准伪类现在将以 :-soup- 前缀开头。因此，:contains() 现在将被称为 :-soup-contains()，尽管有一段时间仍然允许使用 :contains() 的弃用形式，并警告用户应该迁移到 :-soup-包含（）。

新：添加了新的非标准伪类 :-soup-contains-own() 其操作类似于 :-soup-contains() 除了它只查看与当前作用域元素直接关联的文本节点，而不是它的后代。

引自@facelessuser github 页面。

Answer 4

Mem*_*min 5

由于Beautiful Soup 4.4.0.调用的参数可以完成以前版本中string所做的工作text。

string用于查找字符串，您可以将其与查找标签的参数结合使用：Beautiful Soup 将查找 .string 与您的字符串值匹配的所有标签。此代码查找 .string 为“Elsie”的标签：

soup.find_all("td", string="Elsie")

Run Code Online (Sandbox Code Playgroud)

有关字符串的更多信息，请查看此部分https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument

如果标签发生变化怎么办，因此您无法像本例中的“td”那样明确地了解标签。这种情况我能做什么呢？ (2认同)

归档时间：	13 年，11 月前
查看次数：	76994 次
最近记录：	6 年，1 月前