Yvi*_*ihs 6 python json python-3.x pandas
我有一个如下所示的数据集。它是关系型的,但有一个名为 的维度,它是与相应行中event_params相关的数据的 JSON 对象。event_name
import pandas as pd
a_df = pd.DataFrame(data={
'date_time': ['2021-01-03 15:12:42', '2021-01-03 15:12:46', '2021-01-03 15:13:01'
, '2021-01-03 15:13:12', '2021-01-03 15:13:13', '2021-01-03 15:13:15'
, '2021-01-04 03:29:01', '2021-01-04 18:15:14', '2021-01-04 18:16:01'],
'user_id': ['dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', '38nr10', '38nr10', '38nr10'],
'account_id': ['181d9k', '181d9k', '181d9k', '181d9k', '181d9k', '181d9k', '56sf15', '56sf15', '56sf15'],
'event_name': ['button_click', 'screen_view', 'close_view', 'button_click', 'exit_app', 'uninstall_app'
, 'install_app', 'exit_app', 'uninstall_app'],
'event_params': ['{\'button_id\': \'shop_screen\', \'button_container\': \'main_screen\', \'button_label_text\': \'Enter Shop\'}',
'{\'screen_id\': \'shop_main_page\', \'screen_controller\': \'main_view_controller\', \'screen_title\': \'Main Menu\'}',
'{\'screen_id\': \'shop_main_page\'}',
'{\'button_id\': \'back_to_main_menu\', \'button_container\': \'shop_screen\', \'button_label_text\': \'Exit Shop\'}',
'{}',
'{}',
'{\'utm_campaign\': \'null\', \'utm_source\': \'null\'}',
'{}',
'{}']
})
Run Code Online (Sandbox Code Playgroud)
我正在寻找如何处理此类数据的方法。我最初的方法是使用 pandas,但我对其他方法持开放态度。
我理想的最终状态是检查与每个用户的每个关系。在当前表单中,我必须比较其中的 dicts/JSON blobevent_params以确定事件背后的上下文。
我尝试过使用explode()来扩展该event_params列。我的想法是最好的方法是转变event_params为关系格式,其中每个参数都是数据帧相对于其前面的值的额外行(换句话说,同时维护date_time,user_id并且event_name它最初也相关) 。
我的爆炸方法效果不佳,
a_df['event_params'] = a_df['event_params'].apply(eval)
exploded_df = a_df.explode('event_params')
Run Code Online (Sandbox Code Playgroud)
其输出是:
date_time, user_id, account_id, event_name, event_params
2021-01-03 15:12:42,dhj13h,181d9k,button_click,button_id
2021-01-03 15:12:42,dhj13h,181d9k,button_click,button_container
Run Code Online (Sandbox Code Playgroud)
它确实有效,但它剥离了值字段。理想情况下,我也想保留这些值字段。
我希望我正确理解了你的问题。您可以将event_params列从字典转换为字典列表,将其分解并转换为新列key/ value:
from ast import literal_eval
a_df = a_df.assign(
event_params=a_df["event_params"].apply(
lambda x: [{"key": k, "value": v} for k, v in literal_eval(x).items()]
)
).explode("event_params")
a_df = pd.concat(
[a_df, a_df.pop("event_params").apply(pd.Series)],
axis=1,
).drop(columns=0)
print(a_df)
Run Code Online (Sandbox Code Playgroud)
印刷:
date_time user_id account_id event_name key value
0 2021-01-03 15:12:42 dhj13h 181d9k button_click button_id shop_screen
0 2021-01-03 15:12:42 dhj13h 181d9k button_click button_container main_screen
0 2021-01-03 15:12:42 dhj13h 181d9k button_click button_label_text Enter Shop
1 2021-01-03 15:12:46 dhj13h 181d9k screen_view screen_id shop_main_page
1 2021-01-03 15:12:46 dhj13h 181d9k screen_view screen_controller main_view_controller
1 2021-01-03 15:12:46 dhj13h 181d9k screen_view screen_title Main Menu
2 2021-01-03 15:13:01 dhj13h 181d9k close_view screen_id shop_main_page
3 2021-01-03 15:13:12 dhj13h 181d9k button_click button_id back_to_main_menu
3 2021-01-03 15:13:12 dhj13h 181d9k button_click button_container shop_screen
3 2021-01-03 15:13:12 dhj13h 181d9k button_click button_label_text Exit Shop
4 2021-01-03 15:13:13 dhj13h 181d9k exit_app NaN NaN
5 2021-01-03 15:13:15 dhj13h 181d9k uninstall_app NaN NaN
6 2021-01-04 03:29:01 38nr10 56sf15 install_app utm_campaign null
6 2021-01-04 03:29:01 38nr10 56sf15 install_app utm_source null
7 2021-01-04 18:15:14 38nr10 56sf15 exit_app NaN NaN
8 2021-01-04 18:16:01 38nr10 56sf15 uninstall_app NaN NaN
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
354 次 |
| 最近记录: |