BeautifulSoup 有多个标签，每个标签都有一个特定的类

Question

BeautifulSoup 有多个标签，每个标签都有一个特定的类

Pag*_*Max 9 html python tags beautifulsoup findall

我正在尝试使用 beautifulsoup 来解析网站上的表格。（我无法分享网站源代码，因为它被限制使用。）

仅当数据具有以下两个具有这些特定类的标签时，我才尝试提取数据。

td, width=40%
tr, valign=top

Run Code Online (Sandbox Code Playgroud)

我这样做的原因是提取具有这些标签和类的数据。

我在这里找到了一些关于使用多个标签的讨论，但这个讨论只讨论了标签而不是类。但是，我确实尝试使用与使用列表相同的逻辑来扩展代码，但我认为我得到的不是我想要的：

 my_soup=soup.find_all(['td',{"width":"40%"},'tr',{'valign':'top'}])

Run Code Online (Sandbox Code Playgroud)

总而言之，我的查询是如何使用多个标签，每个标签在 find_all 中都有一个特定的类，以便结果“和”两个标签。

Answer 1

Aja*_*234 5

你可以使用一个re.compile对象soup.find_all：

import re
from bs4 import BeautifulSoup as soup
html = """
  <table>
    <tr style='width:40%'>
      <td style='align:top'></td>
    </tr>
  </table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top')})

Run Code Online (Sandbox Code Playgroud)

输出：

[<tr style="width:40%">
   <td style="align:top"></td>
 </tr>, <td style="align:top"></td>]

Run Code Online (Sandbox Code Playgroud)

通过提供re.compile指定所需的标签和对象style的值，find_all将返回的任何实例tr或td含内嵌标记style的任一属性width:40%或align:top。

可以通过提供多个属性值来推断此方法以查找元素：

html = """
 <table>
   <tr style='width:40%'>
    <td style='align:top' class='get_this'></td>
    <td style='align:top' class='ignore_this'></td>
  </tr>
</table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top'), 'class':'get_this'})

Run Code Online (Sandbox Code Playgroud)

输出：

[<td class="get_this" style="align:top"></td>]

Run Code Online (Sandbox Code Playgroud)

编辑 2：简单的递归解决方案：

import bs4
from bs4 import BeautifulSoup as soup
def get_tags(d, params):
  if any((lambda x:b in x if a == 'class' else b == x)(d.attrs.get(a, [])) for a, b in params.get(d.name, {}).items()):
     yield d
  for i in filter(lambda x:x != '\n' and not isinstance(x, bs4.element.NavigableString) , d.contents):
     yield from get_tags(i, params)

html = """
 <table>
  <tr style='align:top'>
    <td style='width:40%'></td>
    <td style='align:top' class='ignore_this'></td>
 </tr>
 </table>
"""
print(list(get_tags(soup(html, 'html.parser'), {'td':{'style':'width:40%'}, 'tr':{'style':'align:top'}})))

Run Code Online (Sandbox Code Playgroud)

输出：

[<tr style="align:top">
  <td style="width:40%"></td>
  <td class="ignore_this" style="align:top"></td>
 </tr>, <td style="width:40%"></td>]

Run Code Online (Sandbox Code Playgroud)

递归函数使您能够为您自己的字典提供某些标签所需的目标属性：此解决方案尝试将任何指定的属性与bs4传递给函数的对象进行匹配，如果发现匹配，则元素被yield编辑。

Answer 2

Tar*_*pta 1

假设 bsObj 是你美丽的汤对象尝试：

tr = bsObj.findAll('tr', {'valign': 'top'})
td = tr.findAll('td', {'width': '40%'})

Run Code Online (Sandbox Code Playgroud)

希望这可以帮助。

归档时间：	9 年，2 月前
查看次数：	5336 次
最近记录：	6 年，4 月前