如何有效解析 JSON 内容的 Pandas 列?

Ale*_*rdt 8 python performance json pandas

假设我有以下 DataFrame,其中该data列包含一个嵌套的 JSON 字符串,我想将其解析为单独的列:

import pandas as pd

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)


df
Run Code Online (Sandbox Code Playgroud)

我找到了这个json_normalize函数,目前正在列表理解中解析 JSON;结果是正确的,但这需要很长时间。1000 行需要 1-2 秒,而我在实际运行中大约有一百万行:

import json
from pandas.io.json import json_normalize

parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])

parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['user_id']

parsed_df
Run Code Online (Sandbox Code Playgroud)

有没有更快的方法将这些数据解析为漂亮的 DataFrame?

jpp*_*jpp 3

我发现绕过pandas.concat.

否则,重写/优化json_normalize似乎并不简单。

def original(df):
    parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])

    parsed_df['bank_account'] = df['bank_account'].values
    parsed_df.index = parsed_df['uid']

    return parsed_df

def jp(df):

    cols = ['account_data.currency.current', 'account_data.currency.minimum',
            'account_data.fees.monthly', 'account_type', 'uid', 'user_name']

    parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']],
                             columns=cols)

    parsed_df['bank_account'] = df['bank_account'].values
    parsed_df.index = parsed_df['uid']

    return parsed_df

df = pd.concat([df]*100, ignore_index=True)

%timeit original(df)  # 675 ms per loop
%timeit jp(df)        # 526 ms per loop
Run Code Online (Sandbox Code Playgroud)