Shi*_*ith 2 html python beautifulsoup python-3.x
我正在尝试从ABBV 10-k sec 文件中的一个表格中提取列标题(第 25 页上的“发行人购买股权证券”表格-图表下方。)
<td>列标题标签中的内部标签<tr>,文本位于单独的<div>标签中,如下例所示
<tr>
<td>
<div>string1</div>
<div>string2</div>
<div>string3</div>
</td>
</tr>
Run Code Online (Sandbox Code Playgroud)
当尝试从标签中提取所有文本时,文本之间没有空格分隔(例如,对于上述 html 输出将是string1string3string3预期的string1 string3 string3)。
使用下面的代码从表中提取列标题
url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
row_data=[]
cells = tr.find_all(['td', 'th'], recursive=False)
for cell in cells[1:4]:
row_data.append(cell.text.encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])
print(table_data)
Run Code Online (Sandbox Code Playgroud)
输出:
[['(a) TotalNumberof Shares(or Units)Purchased', '', '(b) AveragePricePaid per Share(or Unit)']]预期输出:(
[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']]每个单词分隔一个空格)
使用separator参数.get_text():
html = '''<tr>
<td>
<div>string1</div>
<div>string2</div>
<div>string3</div>
</td>
</tr>'''
import bs4
soup = bs4.BeautifulSoup(html, 'html.parser')
td = soup.find('td')
td.get_text(separator=' ')
Run Code Online (Sandbox Code Playgroud)
这是您的代码的外观:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
row_data=[]
cells = tr.find_all(['td', 'th'], recursive=False)
for cell in cells[1:4]:
row_data.append(cell.get_text(separator=' ').encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])
print(table_data)
Run Code Online (Sandbox Code Playgroud)
输出:
print(table_data)
[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']]
Run Code Online (Sandbox Code Playgroud)