Hok*_*mon 10 python json pandas json-normalize
希望提高我的数据科学技能。我正在练习从体育网站拉取 url 数据,并且 json 文件有多个嵌套字典。我希望能够提取这些数据以在 matplotlib 等中映射我自己的排行榜的自定义形式,但是我很难将 json 转换为可行的 df。
主要网站为:https : //www.usopen.com/scoring.html
看看背景,我相信实时信息是从下面短代码中列出的链接中提取的。我正在使用 Jupyter 笔记本。我可以成功拉取数据。
但是正如您所看到的,它正在提取多个嵌套字典,这使得提取简单的数据帧变得非常困难。
只是想得到球员,比分达到标准杆,总分和回合拉。任何帮助将不胜感激,谢谢!
import pandas as pd
import urllib as ul
import json
url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
response = ul.request.urlopen(url)
data = json.loads(response.read())
print(data)
Run Code Online (Sandbox Code Playgroud)
Tre*_*ney 11
requests.get(url).json()
获取数据pandas.json_normalize
将standings
密钥解包到数据帧中roundScores
是一个字典列表
.explode
df
import requests
import pandas as pd
# load the data
df = pd.json_normalize(requests.get(url).json(), 'standings')
# explode the roundScores column
df = df.explode('roundScores').reset_index(drop=True)
# normalize the dicts in roundScores and join back to df
df = df.join(pd.json_normalize(df.roundScores), rsuffix='_rs').drop(columns=['roundScores']).reset_index(drop=True)
# display(df.head())
isRecapAvailable player.identifier player.firstName player.lastName player.image.gravity player.image.type player.image.identifier player.image.cropMode player.country.name player.country.code player.country.flag.type player.country.flag.identifier player.isAmateur toPar.value toPar.format toPar.displayValue toParToday.value toParToday.format toParToday.displayValue totalScore.value totalScore.format totalScore.displayValue position.value position.format position.displayValue holesThrough.value holesThrough.format holesThrough.displayValue liveVideo.identifier liveVideo.isLive score.value score.format score.displayValue toPar.value_rs toPar.format_rs toPar.displayValue_rs
0 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 66 absolute 66 -4 absolute -4
1 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 74 absolute 74 4 absolute +4
2 True 56278 Matthew Wolff center imageCloudinary us-open/players/2020-players/Matthew_Wolff fill United States usa imageCloudinary us-open/flags/usa False -5 absolute -5 -5 absolute -5 140.0 absolute 140 1 absolute 1 10 absolute 10 NaN NaN 0 absolute -5 absolute -5
3 True 34360 Patrick Reed center imageCloudinary us-open/players/2019-players/Patrick-Reed fill United States usa imageCloudinary us-open/flags/usa False -4 absolute -4 0 absolute E 136.0 absolute 136 2 absolute 2 7 absolute 7 NaN NaN 66 absolute 66 -4 absolute -4
4 True 34360 Patrick Reed center imageCloudinary us-open/players/2019-players/Patrick-Reed fill United States usa imageCloudinary us-open/flags/usa False -4 absolute -4 0 absolute E 136.0 absolute 136 2 absolute 2 7 absolute 7 NaN NaN 70 absolute 70 0 absolute E
Run Code Online (Sandbox Code Playgroud)
standings
只是下载的 JSON 中的键之一r = requests.get(url).json()
print(r)
[out]:
dict_keys(['currentRound', 'standings', 'fullLegend', 'shortLegend', 'inlineLegend', 'cutLine', 'meta'])
Run Code Online (Sandbox Code Playgroud)
小智 2
简单快速的解决方案。使用 pandas 的 JSON 规范化可能存在更好的解决方案,但这对于您的用例来说相当不错。
def func(x):
if not any(x.isnull()):
return (x['round'], x['player']['firstName'], x['player']['identifier'], x['toParToday']['value'], x['totalScore']['value'])
df = pd.DataFrame(data['standings'])
df['round'] = data['currentRound']['name']
df = df[['player', 'toPar', 'toParToday', 'totalScore', 'round']]
info = df.apply(func, axis=1)
info_df = pd.DataFrame(list(info.values), columns=['Round', 'player_name', 'pid', 'to_par_today', 'totalScore'])
info_df.head()
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
644 次 |
最近记录: |