是否有一种有效的方法来创建分层JSON(n级深度),其中父值是键而不是变量标签?即:
{"2017-12-31":
{"Junior":
{"Electronics":
{"A":
{"sales": 0.440755
}
},
{"B":
{"sales": -3.230951
}
}
}, ...etc...
}, ...etc...
}, ...etc...
Run Code Online (Sandbox Code Playgroud)
1.我的测试DataFrame:
colIndex=pd.MultiIndex.from_product([['New York','Paris'],
['Electronics','Household'],
['A','B','C'],
['Junior','Senior']],
names=['City','Department','Team','Job Role'])
rowIndex=pd.date_range('25-12-2017',periods=12,freq='D')
df1=pd.DataFrame(np.random.randn(12, 24), index=rowIndex, columns=colIndex)
df1.index.name='Date'
df2=df1.resample('M').sum()
df3=df2.stack(level=0).groupby('Date').sum()
Run Code Online (Sandbox Code Playgroud)
2.我正在进行转换,因为它似乎是构建JSON的最合理的结构:
df4=df3.stack(level=[0,1,2]).reset_index() \
.set_index(['Date','Job Role','Department','Team']) \
.sort_index()
Run Code Online (Sandbox Code Playgroud)
我的尝试 - 迄今为止
我遇到了这个非常有用的SO问题,它使用以下代码解决了一级嵌套的问题:
j =(df.groupby(['ID','Location','Country','Latitude','Longitude'],as_index=False) \
.apply(lambda x: x[['timestamp','tide']].to_dict('r'))\
.reset_index()\
.rename(columns={0:'Tide-Data'})\
.to_json(orient='records'))
Run Code Online (Sandbox Code Playgroud)
...但是我找不到让嵌套.groupby()工作的方法:
j=(df.groupby('date', as_index=True).apply(
lambda x: x.groupby('Job Role', as_index=True).apply(
lambda x: x.groupby('Department', as_index=True).apply(
lambda x: x.groupby('Team', as_index=True).to_dict()))) \
.reset_index().rename(columns={0:'sales'}).to_json(orient='records'))
Run Code Online (Sandbox Code Playgroud) 我有一个带有MultiIndex的数据帧.我想知道我是否以正确的方式创建了数据框(见下文).
01.01 02.01 03.01 04.01
bar total1 40 52 18 11
total2 36 85 5 92
baz total1 23 39 45 70
total2 50 49 51 65
foo total1 23 97 17 97
total2 64 56 94 45
qux total1 13 73 38 4
total2 80 8 61 50
Run Code Online (Sandbox Code Playgroud)
df.index.values 结果是:
array([('bar', 'total1'), ('bar', 'total2'), ('baz', 'total1'),
('baz', 'total2'), ('foo', 'total1'), ('foo', 'total2'),
('qux', 'total1'), ('qux', 'total2')], dtype=object)
Run Code Online (Sandbox Code Playgroud)
df.index.get_level_values 结果是:
<bound method MultiIndex.get_level_values of MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'total1', u'total2']], …Run Code Online (Sandbox Code Playgroud) 我想创建我的数据框,如下所示:
employeeId firstName lastName emailAddress isDependent employeeIdTypeCode entityCode sourceCode roleCode
0 E123456 Andrew Hoover hoovera@xyz.com False 001 AE AHR EMPLR
0 102939485 Andrew Hoover hoovera@xyz.com False 002 AE AHR EMPLR
2 E123458 Celeste Riddick riddickc@xyz.com True 001 AE AHR EMPLR
2 354852739 Celeste Riddick riddickc@xyz.com True 002 AE AHR EMPLR
1 E123457 Curt Austin austinc1@xyz.com True 001 AE AHR EMPLR
1 675849302 Curt Austin austinc1@xyz.com True 002 AE AHR EMPLR
3 E123459 Hazel Tooley tooleyh@xyz.com False 001 AE AHR EMPLR …Run Code Online (Sandbox Code Playgroud) 我正在尝试从我的 spark 数据帧创建一个嵌套的 json,它具有以下结构的数据。下面的代码正在创建一个带有键和值的简单 json。能否请你帮忙
df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
Run Code Online (Sandbox Code Playgroud)
更新 1:根据@MaxU 的回答,我将 spark 数据框转换为 pandas 并使用了 group by。它将最后两个字段放入嵌套数组中。我如何首先将类别和计数放在嵌套数组中,然后在该数组中放入子类别和计数。
示例文本数据:
Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4
j = (data_pd.groupby(['vendor_name','vendor_Cnt','Category','Category_cnt'], as_index=False)
.apply(lambda x: x[['Subcategory','subcategory_cnt']].to_dict('r'))
.reset_index()
.rename(columns={0:'subcategories'})
.to_json(orient='records'))
Run Code Online (Sandbox Code Playgroud)
[{
"vendor_name": "Vendor 1",
"count": 10,
"categories": [{
"name": "Category 1",
"count": 4,
"subCategories": [{
"name": "Sub Category 1",
"count": 1
},
{
"name": "Sub Category 2",
"count": 1
},
{
"name": "Sub Category 3",
"count": …Run Code Online (Sandbox Code Playgroud) pandas ×3
python ×3
json ×2
apache-spark ×1
dataframe ×1
dictionary ×1
pyspark ×1
pyspark-sql ×1
python-3.x ×1