在Pandas数据帧中提取嵌入为字符串的嵌套JSON

SoH*_*Hei 9 python csv json

我有一个CSV,其中一个字段是嵌套的JSON对象,存储为字符串.我想将CSV加载到数据帧中,并将JSON解析为附加到原始数据帧的一组字段; 换句话说,提取JSON的内容并使它们成为数据帧的一部分.

我的CSV:

id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Run Code Online (Sandbox Code Playgroud)

请注意,并非所有行的所有键都相同.我希望它能产生一个与此相当的数据框:

data = {'id'     : [1, 2, 3],
        'dist'  : [67, 34, 98],
        'loc_lat': [45.7, 46.89, 45.70],
        'loc_lon': [38.9, 36.7, 31.0],
        'arrival': ["Monday", "Tuesday", "NA"],
        'characteristics_body_color':["red", "blue", "yellow"],
        'characteristics_body_make':["sedan", "sedan", "NA"],
        'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
Run Code Online (Sandbox Code Playgroud)

(我很抱歉,我不能让桌子本身看起来很明智!请不要生我的气,我是菜鸟:()

我试过的

在经历了很多困难之后,我提出了以下解决方案:

#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")

#Parsing function
def parse_request(s):
    sj = json.loads(s)
    norm = json_normalize(sj)
    return norm

#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])

#Loop through and parse JSON in each row
for i in df_raw.json_request:
    parsed = parsed.append(parse_request(i))

#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
Run Code Online (Sandbox Code Playgroud)

这显然是不优雅的,效率非常低(在我需要解析的300K行上需要多个小时).有没有更好的办法?

我看的地方

我已经完成了以下相关问题: 将CSV读入pandas,其中一列是json字符串 (似乎只适用于简单的非嵌套JSON)

JSON到pandas DataFrame (我从中借用了部分解决方案,但我无法弄清楚如何在数据帧中应用此解决方案而不循环遍历行)

我使用的是Python 3.3和Pandas 0.17.

jme*_*jme 10

这是一种方法,可以将速度提高10到100倍,并且应该允许您在一分钟内读取大文件,而不是一个多小时.我们的想法是,只有在读取了所有数据后才构造数据帧,从而减少了需要分配内存的次数,并且只json_normalize对整个数据块调用一次,而不是每行调用一次:

import csv
import json

import pandas as pd
from pandas.io.json import json_normalize

with open('sample.csv') as fh:
    rows = csv.reader(fh, delimiter='|')
    header = next(rows)

    # "transpose" the data. `data` is now a tuple of strings
    # containing JSON, one for each row
    idents, dists, data = zip(*rows)

data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
Run Code Online (Sandbox Code Playgroud)

以便:

>>> print(df)

   arrival characteristics.body.color characteristics.body.make  \
0   Monday                        red                     sedan   
1  Tuesday                       blue                     sedan   
2      NaN                     yellow                       NaN   

   characteristics.manuf_year  loc.lat  loc.lon ids  
0                        2014    45.70     38.9   1  
1                        2014    46.89     36.7   2  
2                        2010    45.70     31.0   3
Run Code Online (Sandbox Code Playgroud)

此外,我看着什么pandasjson_normalize是干什么的,它执行一些深层次的副本,如果你只是建立从CSV一个数据帧不应该是必要的.我们可以实现我们自己的flatten函数,它接受字典并"平坦化"键,类似于json_normalize它.然后我们可以创建一个生成器,它一次吐出一行数据帧作为记录.这种方法更快:

def flatten(dct, separator='_'):
    """A fast way to flatten a dictionary,"""
    res = {}
    queue = [('', dct)]

    while queue:
        prefix, d = queue.pop()
        for k, v in d.items():
            key = prefix + k
            if not isinstance(v, dict):
                res[key] = v
            else:
                queue.append((key + separator, v))

    return res

def records_from_json(fh):
    """Yields the records from a file object."""
    rows = csv.reader(fh, delimiter='|')
    header = next(rows)
    for ident, dist, data in rows:
        rec = flatten(json.loads(data))
        rec['id'] = ident
        rec['dist'] = dist
        yield rec

def from_records(path):
    with open(path) as fh:
        return pd.DataFrame.from_records(records_from_json(fh))
Run Code Online (Sandbox Code Playgroud)

以下是计时实验的结果,我通过重复行人为地增加了样本数据的大小.行数表示为n_rows:

        method 1 (s)  method 2 (s)  original time (s)
n_rows                                               
96          0.008217      0.002971           0.362257
192         0.014484      0.004720           0.678590
384         0.027308      0.008720           1.373918
768         0.055644      0.016175           2.791400
1536        0.105730      0.030914           5.727828
3072        0.209049      0.060105          11.877403
Run Code Online (Sandbox Code Playgroud)

线性外推,第一种方法应该在大约20秒内读取300k行,而第二种方法应该花费大约6秒.