使用 BeautifulSoup 提取特定标题下的文本

Question

使用 BeautifulSoup 提取特定标题下的文本

th3*_*33n 5 python beautifulsoup web-scraping

如何提取特定标题下方的所有文本？在这种情况下，我需要提取下的文本Topic 2。 编辑： 在其他网页上，“主题 2”有时显示为第三个标题或第一个标题。“主题 2”并不总是在同一个位置，并且它并不总是具有相同的 ID 号。

# import library
from bs4 import BeautifulSoup

# dummy webpage text
body = '''
<h2 id="1">Topic 1</h2>
<p> This is the first sentence.</p>
<p> This is the second sentence.</p>
<p> This is the third sentence.</p>

<h2 id="2">Topic 2</h2>
<p> This is the fourth sentence.</p>
<p> This is the fifth sentence.</p>

<h2 id="3">Topic 3</h2>
<p> This is the sixth sentence.</p>
<p> This is the seventh sentence.</p>
<p> This is the eighth sentence.</p>
'''

# convert text to soup 
soup = BeautifulSoup(body, 'lxml')

Run Code Online (Sandbox Code Playgroud)

如果我仅提取“主题 2”下的文本，这就是我的输出。

This is the fourth sentence. This is the fifth sentence.

Run Code Online (Sandbox Code Playgroud)

我尝试解决这个问题：

我尝试过soup.select('h2 + p')，但这只让我得到每个标题下的第一句话。

[<p> This is the first sentence.</p>,
 <p> This is the fourth sentence.</p>,
 <p> This is the sixth sentence.</p>]

Run Code Online (Sandbox Code Playgroud)

我也尝试过这个，但是当我只需要以下文本时，它给了我所有文本Topic 2：

import pandas as pd 

lst = []
for row in soup.find_all('p'):
    text_dict = {}
    text_dict['text'] = row.text
    lst.append(text_dict)

df = pd.DataFrame(lst) 

df

|   | text                          |
|---|-------------------------------|
| 0 | This is the first sentence.   |
| 1 | This is the second sentence.  |
| 2 | This is the third sentence.   |
| 3 | This is the fourth sentence.  |
| 4 | This is the fifth sentence.   |
| 5 | This is the sixth sentence.   |
| 6 | This is the seventh sentence. |
| 7 | This is the eighth sentence.  |

Run Code Online (Sandbox Code Playgroud)

Answer 1

Jac*_*ing 4

尝试：

target = soup.find('h2',text='Topic 2')
for sib in target.find_next_siblings():
    if sib.name=="h2":
        break
    else:
        print(sib.text)

Run Code Online (Sandbox Code Playgroud)

输出（来自上面的 html）：

 This is the fourth sentence.
 This is the fifth sentence.

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，11 月前
查看次数：	6522 次
最近记录：	4 年，1 月前