Python Beautiful soup 为每个表获取正确的列标题

Question

Python Beautiful soup 为每个表获取正确的列标题

Leo*_*res 5 python beautifulsoup python-3.x

以下代码获取玩家数据，但每个数据集都不同。它看到的第一个数据是四分卫数据，因此它使用这些列来存储接下来的所有数据。如何更改标头，以便对于遇到的每个不同数据集，正确的标头与正确的数据一起使用？

import pandas as pd
import csv
from pprint import pprint

from bs4 import BeautifulSoup
import requests

url = 'https://www.espn.com/nfl/boxscore/_/gameId/401326313'# Create object page
soup = BeautifulSoup(requests.get(url).content, "html.parser")

rows = soup.select("table.mod-data tr")
#rows = soup.find_all("table.mod-data tr")
headers = [header.get_text(strip=True).encode("utf-8") for header in rows[0].find_all("th")]

data = [dict(zip(headers, [cell.get_text(strip=True).encode("utf-8") for cell in row.find_all("td")]))
        for row in rows[1:]]

df = pd.DataFrame(data)
df.to_csv('_Data_{}.csv'.format(pd.datetime.now().strftime("%Y-%m-%d %H%M%S")),index=False)

# see what the data looks like at this point
pprint(data)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Hed*_*Hog 1

如前所述，预期结果并不那么清楚，但如果您只想阅读用于pandas.read_html实现目标的表格 -index_col=0避免将没有标题的第一列命名为Unnamed_0。

pd.read_html('https://www.espn.com/nfl/boxscore/_/gameId/401326313',index_col=0)

Run Code Online (Sandbox Code Playgroud)

例子

import pandas as pd

for table in pd.read_html('https://www.espn.com/nfl/boxscore/_/gameId/401326313',index_col=0):
    pd.DataFrame(table).to_csv('_Data_{}.csv'.format(pd.datetime.today().strftime("%Y-%m-%d %H%M%S.%f")))

Run Code Online (Sandbox Code Playgroud)

作为替代方案，您可以reset_index()并使用to_csv(index=False)：

pd.DataFrame(table).rename_axis('').reset_index().to_csv('_Data_{}.csv'.format(pd.datetime.today().strftime("%Y-%m-%d %H%M%S.%f")),index=False)

Run Code Online (Sandbox Code Playgroud)

编辑

在表中使用标题并将结果存储在命名的 csv 文件中：

import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.espn.com/nfl/boxscore/_/gameId/401326313'
soup = BeautifulSoup(requests.get(url).content, "html.parser")


for table in soup.select('article.boxscore-tabs table'):
    caption = '_'.join(table.parent.select_one('.table-caption').text.split(' '))
    df = pd.read_html(table.prettify(),index_col=0)[0].rename_axis('').reset_index()
    df.insert(0, 'caption', caption)
    df.to_csv(f'_DATA_{caption}_{pd.datetime.now().strftime("%Y-%m-%d %H%M%S")}.csv',index=False)

Run Code Online (Sandbox Code Playgroud)

csv 文件的输出

pd.read_html('https://www.espn.com/nfl/boxscore/_/gameId/401326313',index_col=0)

Run Code Online (Sandbox Code Playgroud)

归档时间：	3 年，11 月前
查看次数：	1476 次
最近记录：	3 年，11 月前