将嵌套的JSON数据查看到pandas数据帧中

sim*_*ham 5 python json dataframe pandas

我现在已将当前问题添加到GitHib中.请找到repo的URL.我已经包含了一个Jupyter笔记本,也解释了这个问题.多谢你们.

https://github.com/simongraham/dataExplore.git


我目前正在处理项目的营养数据,其中数据采用原始JSON格式,我想使用python和pandas来获得可理解的数据框架.我知道当JSON没有嵌套时,这是一项简单的任务.在这里我会用:

nutrition = pd.read_json('data')
Run Code Online (Sandbox Code Playgroud)

但是我有嵌套信息,我发现很难将它放入合理的数据框架中.JSON格式如下,其中nutrition营养元素本身是嵌套元素.这个元素的巢将描述各种不同的东西的营养成分,如酒精和bcfa,包括在内.我只包含了一个示例,因为这是一个大型数据文件.

  [
        {
            "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
            "vcNutritionId": "2476378b-79ee-4857-a81d-489661a039a1",
            "vcUserId": "cc51145b-5a70-4344-9b55-1a4455f0a9d2",
            "vcPortionId": "1",
            "vcPortionName": "1 average pepper",
            "vcPortionSize": "20",
            "ftEnergyKcal": 5.2,
            "vcPortionUnit": "g",
            "dtConsumedDate": "2016-05-04T00:00:00",
            "nutritionNutrients": [
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "alcohol",
                    "ftValue": 0,
                    "vcUnit": "g",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "bcfa",
                    "ftValue": 0,
                    "vcUnit": "g",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "biotin",
                    "ftValue": 0,
                    "vcUnit": "µg",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                ...
            ]
        }
    ]
Run Code Online (Sandbox Code Playgroud)

任何帮助,将不胜感激.

谢谢.

.... ....

现在我已经找到了如何使用json_normalize解决这个问题,我返回相同的问题,但这次我的代码嵌套了两次.即:

[
{
...
}
[,
"nutritionPortions": [
    {
        "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
        "vcNutritionId": "2476378b-79ee-4857-a81d-489661a039a1",
        "vcUserId": "cc51145b-5a70-4344-9b55-1a4455f0a9d2",
        "vcPortionId": "1",
        "vcPortionName": "1 average pepper",
        "vcPortionSize": "20",
        "ftEnergyKcal": 5.2,
        "vcPortionUnit": "g",
        "dtConsumedDate": "2016-05-04T00:00:00",
        "nutritionNutrients": [
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "alcohol",
                "ftValue": 0,
                "vcUnit": "g",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "bcfa",
                "ftValue": 0,
                "vcUnit": "g",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "biotin",
                "ftValue": 0,
                "vcUnit": "µg",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            ...
           }
          ]
        }
      ]
Run Code Online (Sandbox Code Playgroud)

当我有一个只包含营养数据的JSON时,我可以使用:

nutrition = (pd.io
   .json
   .json_normalize((data, ['nutritionPortions']), 'nutritionNutrients',
        ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
         'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
)
Run Code Online (Sandbox Code Playgroud)

但是,我的数据不仅包含营养信息.例如,它将包含活动信息,因此营养信息在开始时与"nutrtitionPortions"嵌套.让我们假设所有其他列都没有嵌套,它们由"Activity"和"Wellbeing"表示.

如果我使用代码:

nutrition = (pd.io
   .json
   .json_normalize(data, ['nutritionPortions'])
)
Run Code Online (Sandbox Code Playgroud)

我将回到原来的问题,其中"营养营养素"是嵌套的,但我没有成功,然后获得相应的数据框.

谢谢

Max*_*axU 4

更新:这应该适用于您的kaidoData.json文件:

df = (pd.io
        .json
        .json_normalize(data[0]['ionPortions'], 'nutritionNutrients',
            ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
             'dtCreatedDate','dtUpdatedDate','nProcessingStatus',
             'vcPortionUnit','dtConsumedDate'
            ]
        )
)
Run Code Online (Sandbox Code Playgroud)

PS 我不知道“ftEnergyKcal”出了什么问题 - 它让我困惑:

关键错误:'ftEnergyKcal'

也许某些部分丢失了

旧答案:

使用json_normalize()

(pd.io
   .json
   .json_normalize(l, 'nutritionNutrients',
        ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
         'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
)
Run Code Online (Sandbox Code Playgroud)

演示:

In [107]: (pd.io
   .....:    .json
   .....:    .json_normalize(l, 'nutritionNutrients',
   .....:         ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
   .....:          'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
   .....: )
Out[107]:
   ftValue  nPercentRI vcNutrient vcNutritionPortionId vcTrafficLight        ...        vcPortionSize  \
0        0           0    alcohol  478d1905-f264-4d...                       ...                   20
1        0           0       bcfa  478d1905-f264-4d...                       ...                   20
2        0           0     biotin  478d1905-f264-4d...                       ...                   20

         vcNutritionId vcPortionId ftEnergyKcal     vcPortionName
0  2476378b-79ee-48...           1          5.2  1 average pepper
1  2476378b-79ee-48...           1          5.2  1 average pepper
2  2476378b-79ee-48...           1          5.2  1 average pepper

[3 rows x 14 columns]
Run Code Online (Sandbox Code Playgroud)

你的列表在哪里l(已解析的 JSON)