For*_*rge 5 python beautifulsoup pandas
刮刮网来获取桌子,使用Beautiful soup和Pandas.其中一个专栏有一些网址.当我把html传递给熊猫时,href就丢了.
是否有任何方法可以保留该列的URL链接?
示例数据(针对更好的案例进行编辑):
<html>
<body>
<table>
<tr>
<td>customer</td>
<td>country</td>
<td>area</td>
<td>website link</td>
</tr>
<tr>
<td>IBM</td>
<td>USA</td>
<td>EMEA</td>
<td><a href="http://www.ibm.com">IBM site</a></td>
</tr>
<tr>
<td>CISCO</td>
<td>USA</td>
<td>EMEA</td>
<td><a href="http://www.cisco.com">cisco site</a></td>
</tr>
<tr>
<td>unknown company</td>
<td>USA</td>
<td>EMEA</td>
<td></td>
</tr>
</table>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
我的python代码:
file = open(url,"r")
soup = BeautifulSoup(file, 'lxml')
parsed_table = soup.find_all('table')[1]
df = pd.read_html(str(parsed_table),encoding='utf-8')[0]
df
Run Code Online (Sandbox Code Playgroud)
输出(导出为CSV):
customer;country;area;website
IBM;USA;EMEA;IBM site
CISCO;USA;EMEA;cisco site
unknown company;USA;EMEA;
Run Code Online (Sandbox Code Playgroud)
df输出正常但链接丢失.我需要保留链接.至少URL.
任何提示?
unu*_*tbu 12
pd.read_html假设您感兴趣的数据在文本中,而不是标记属性.但是,自己刮桌子并不难:
import bs4 as bs
import pandas as pd
with open(url,"r") as f:
soup = bs.BeautifulSoup(f, 'lxml')
parsed_table = soup.find_all('table')[1]
data = [[td.a['href'] if td.find('a') else
''.join(td.stripped_strings)
for td in row.find_all('td')]
for row in parsed_table.find_all('tr')]
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
Run Code Online (Sandbox Code Playgroud)
产量
customer country area website link
0 IBM USA EMEA http://www.ibm.com
1 CISCO USA EMEA http://www.cisco.com
2 unknown company USA EMEA
Run Code Online (Sandbox Code Playgroud)
只需检查标签是否以这种方式存在:
import numpy as np
with open(url,"r") as f:
sp = bs.BeautifulSoup(f, 'lxml')
tb = sp.find_all('table')[56]
df = pd.read_html(str(tb),encoding='utf-8', header=0)[0]
df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')]
Run Code Online (Sandbox Code Playgroud)