Leo*_*res 5 python beautifulsoup python-3.x
以下代码获取玩家数据,但每个数据集都不同。它看到的第一个数据是四分卫数据,因此它使用这些列来存储接下来的所有数据。如何更改标头,以便对于遇到的每个不同数据集,正确的标头与正确的数据一起使用?
import pandas as pd
import csv
from pprint import pprint
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/nfl/boxscore/_/gameId/401326313'# Create object page
soup = BeautifulSoup(requests.get(url).content, "html.parser")
rows = soup.select("table.mod-data tr")
#rows = soup.find_all("table.mod-data tr")
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("th")]
data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")]))
for row in rows[1:]]
df = pd.DataFrame(data)
df.to_csv('_Data_{}.csv'.format(pd.datetime.now().strftime("%Y-%m-%d %H%M%S")),index=False)
# see what the data looks like at this point
pprint(data)
Run Code Online (Sandbox Code Playgroud)
如前所述,预期结果并不那么清楚,但如果您只想阅读用于pandas.read_html实现目标的表格 -index_col=0避免将没有标题的第一列命名为Unnamed_0。
pd.read_html('https://www.espn.com/nfl/boxscore/_/gameId/401326313',index_col=0)
Run Code Online (Sandbox Code Playgroud)
import pandas as pd
for table in pd.read_html('https://www.espn.com/nfl/boxscore/_/gameId/401326313',index_col=0):
pd.DataFrame(table).to_csv('_Data_{}.csv'.format(pd.datetime.today().strftime("%Y-%m-%d %H%M%S.%f")))
Run Code Online (Sandbox Code Playgroud)
作为替代方案,您可以reset_index()并使用to_csv(index=False):
pd.DataFrame(table).rename_axis('').reset_index().to_csv('_Data_{}.csv'.format(pd.datetime.today().strftime("%Y-%m-%d %H%M%S.%f")),index=False)
Run Code Online (Sandbox Code Playgroud)
在表中使用标题并将结果存储在命名的 csv 文件中:
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.espn.com/nfl/boxscore/_/gameId/401326313'
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for table in soup.select('article.boxscore-tabs table'):
caption = '_'.join(table.parent.select_one('.table-caption').text.split(' '))
df = pd.read_html(table.prettify(),index_col=0)[0].rename_axis('').reset_index()
df.insert(0, 'caption', caption)
df.to_csv(f'_DATA_{caption}_{pd.datetime.now().strftime("%Y-%m-%d %H%M%S")}.csv',index=False)
Run Code Online (Sandbox Code Playgroud)
pd.read_html('https://www.espn.com/nfl/boxscore/_/gameId/401326313',index_col=0)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1476 次 |
| 最近记录: |