如何将嵌套的 JSON 下载到 Pandas 数据帧中

Hok*_*mon 10 python json pandas json-normalize

希望提高我的数据科学技能。我正在练习从体育网站拉取 url 数据,并且 json 文件有多个嵌套字典。我希望能够提取这些数据以在 matplotlib 等中映射我自己的排行榜的自定义形式,但是我很难将 json 转换为可行的 df。

主要网站为:https : //www.usopen.com/scoring.html

看看背景,我相信实时信息是从下面短代码中列出的链接中提取的。我正在使用 Jupyter 笔记本。我可以成功拉取数据。

但是正如您所看到的,它正在提取多个嵌套字典,这使得提取简单的数据帧变得非常困难。

只是想得到球员,比分达到标准杆,总分和回合拉。任何帮助将不胜感激,谢谢!

import pandas as pd
import urllib as ul
import json
url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
response = ul.request.urlopen(url)
data = json.loads(response.read())
print(data)
Run Code Online (Sandbox Code Playgroud)

Tre*_*ney 11

  • 使用requests.get(url).json()获取数据
  • 用于pandas.json_normalizestandings密钥解包到数据帧中
  • roundScores 是一个字典列表
    • 该列表必须扩展为 .explode
    • dicts 列必须再次规范化
  • 将规范化列连接回数据框 df
import requests
import pandas as pd

# load the data
df = pd.json_normalize(requests.get(url).json(), 'standings')

# explode the roundScores column
df = df.explode('roundScores').reset_index(drop=True)

# normalize the dicts in roundScores and join back to df
df = df.join(pd.json_normalize(df.roundScores), rsuffix='_rs').drop(columns=['roundScores']).reset_index(drop=True)

# display(df.head())
   isRecapAvailable player.identifier player.firstName player.lastName player.image.gravity player.image.type                     player.image.identifier player.image.cropMode player.country.name player.country.code player.country.flag.type player.country.flag.identifier  player.isAmateur  toPar.value toPar.format toPar.displayValue  toParToday.value toParToday.format toParToday.displayValue  totalScore.value totalScore.format totalScore.displayValue  position.value position.format position.displayValue  holesThrough.value holesThrough.format holesThrough.displayValue liveVideo.identifier liveVideo.isLive  score.value score.format score.displayValue  toPar.value_rs toPar.format_rs toPar.displayValue_rs
0              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN           66     absolute                 66              -4        absolute                    -4
1              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN           74     absolute                 74               4        absolute                    +4
2              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN            0     absolute                                 -5        absolute                    -5
3              True             34360          Patrick            Reed               center   imageCloudinary   us-open/players/2019-players/Patrick-Reed                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -4     absolute                 -4                 0          absolute                       E             136.0          absolute                     136               2        absolute                     2                   7            absolute                         7                  NaN              NaN           66     absolute                 66              -4        absolute                    -4
4              True             34360          Patrick            Reed               center   imageCloudinary   us-open/players/2019-players/Patrick-Reed                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -4     absolute                 -4                 0          absolute                       E             136.0          absolute                     136               2        absolute                     2                   7            absolute                         7                  NaN              NaN           70     absolute                 70               0        absolute                     E
Run Code Online (Sandbox Code Playgroud)

附加键

  • standings 只是下载的 JSON 中的键之一
r = requests.get(url).json()

print(r)
[out]:
dict_keys(['currentRound', 'standings', 'fullLegend', 'shortLegend', 'inlineLegend', 'cutLine', 'meta'])
Run Code Online (Sandbox Code Playgroud)

资源


小智 2

简单快速的解决方案。使用 pandas 的 JSON 规范化可能存在更好的解决方案,但这对于您的用例来说相当不错。

def func(x):
    if not any(x.isnull()):
        return (x['round'], x['player']['firstName'], x['player']['identifier'], x['toParToday']['value'], x['totalScore']['value'])

df = pd.DataFrame(data['standings'])
df['round'] = data['currentRound']['name']
df = df[['player', 'toPar', 'toParToday', 'totalScore', 'round']]
info = df.apply(func, axis=1)
info_df = pd.DataFrame(list(info.values), columns=['Round', 'player_name', 'pid', 'to_par_today', 'totalScore'])
info_df.head()
Run Code Online (Sandbox Code Playgroud)