使用 bs4 查找包含文本的 html 标签 (h2)

Question

使用 bs4 查找包含文本的 html 标签 (h2)

Mar*_*ary 1 html python beautifulsoup html-parsing

对于这部分html代码：

html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""

Run Code Online (Sandbox Code Playgroud)

我将使用 beautifulsoup 来查找 h2 ，其文本等于“内容逻辑定义”和下一个兄弟姐妹。但是beautifulsoup找不到h2。以下是我的代码：

soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings

Run Code Online (Sandbox Code Playgroud)

这是一个错误：

AttributeError: 'NoneType' object has no attribute 'nextsibilings'

Run Code Online (Sandbox Code Playgroud)

文本中有几个“h2”，但唯一使这个h2独特的字符是“内容逻辑定义”。找到这个 h2 后，我将从表中提取数据并在其下方列出。

Answer 1

ale*_*cxe 5

主要问题是您定位元素h2以从中查找兄弟元素的方式。我会使用一个函数来检查Content Logical Definition文本中的内容：

soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)

Run Code Online (Sandbox Code Playgroud)

另外，要获取下一个兄弟姐妹，您应该使用.next_siblings and not nextsibilings。

演示：

>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
...     print(sibling)
... 
<hr/>
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>

Run Code Online (Sandbox Code Playgroud)

不过，现在知道您正在处理的真实 HTML 以及它有多混乱，我认为您应该迭代兄弟姐妹，在下一个上中断h2，或者如果您找到了table之前的一个。实际执行：

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，12 月前
查看次数：	3638 次
最近记录：	9 年，11 月前