th3*_*33n 5 python beautifulsoup web-scraping
如何提取特定标题下方的所有文本?在这种情况下,我需要提取 下的文本Topic 2。 编辑: 在其他网页上,“主题 2”有时显示为第三个标题或第一个标题。“主题 2”并不总是在同一个位置,并且它并不总是具有相同的 ID 号。
# import library
from bs4 import BeautifulSoup
# dummy webpage text
body = '''
<h2 id="1">Topic 1</h2>
<p> This is the first sentence.</p>
<p> This is the second sentence.</p>
<p> This is the third sentence.</p>
<h2 id="2">Topic 2</h2>
<p> This is the fourth sentence.</p>
<p> This is the fifth sentence.</p>
<h2 id="3">Topic 3</h2>
<p> This is the sixth sentence.</p>
<p> This is the seventh sentence.</p>
<p> This is the eighth sentence.</p>
'''
# convert text to soup
soup = BeautifulSoup(body, 'lxml')
Run Code Online (Sandbox Code Playgroud)
如果我仅提取“主题 2”下的文本,这就是我的输出。
This is the fourth sentence. This is the fifth sentence.
Run Code Online (Sandbox Code Playgroud)
我尝试解决这个问题:
我尝试过soup.select('h2 + p'),但这只让我得到每个标题下的第一句话。
[<p> This is the first sentence.</p>,
<p> This is the fourth sentence.</p>,
<p> This is the sixth sentence.</p>]
Run Code Online (Sandbox Code Playgroud)
我也尝试过这个,但是当我只需要以下文本时,它给了我所有文本Topic 2:
import pandas as pd
lst = []
for row in soup.find_all('p'):
text_dict = {}
text_dict['text'] = row.text
lst.append(text_dict)
df = pd.DataFrame(lst)
df
| | text |
|---|-------------------------------|
| 0 | This is the first sentence. |
| 1 | This is the second sentence. |
| 2 | This is the third sentence. |
| 3 | This is the fourth sentence. |
| 4 | This is the fifth sentence. |
| 5 | This is the sixth sentence. |
| 6 | This is the seventh sentence. |
| 7 | This is the eighth sentence. |
Run Code Online (Sandbox Code Playgroud)
尝试:
target = soup.find('h2',text='Topic 2')
for sib in target.find_next_siblings():
if sib.name=="h2":
break
else:
print(sib.text)
Run Code Online (Sandbox Code Playgroud)
输出(来自上面的 html):
This is the fourth sentence.
This is the fifth sentence.
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6522 次 |
| 最近记录: |