Sla*_*vka 6 python pandas pandas-groupby
我有一个带列的数据框:
diff - 注册日期和付款日期之间的差异,以天为单位country - 用户国家user_idcampaign_id -- 另一个分类列,我们将在 groupby 中使用它我需要为每个具有<=n 的country+campaign_id组计算不同的用户diff数。例如,对于country'A'、campaign'abc' 和diff7,我需要从country'A'、campaign'abc' 和diff <= 7 中
我目前的解决方案(如下)工作时间太长
import pandas as pd
import numpy as np
## generate test dataframe
df = pd.DataFrame({
'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
'diff':np.random.choice(range(10), 10000),
'user_id': np.random.choice(range(1000), 10000)
})
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
tmp_df = df.loc[df['diff']<=diff,:]
tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
tmp_df['diff'] = diff
tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)
Run Code Online (Sandbox Code Playgroud)
也许有更好的方法来做到这一点?
首先使用列表理解 with concatand assignfor join all together,然后groupbywith add nuniquecolumn diff,最后重命名列,并在必要时添加reindex自定义列顺序:
df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
.nunique()
.reset_index()
.rename(columns={'user_id':'unique_ppl'})
.reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
Run Code Online (Sandbox Code Playgroud)
下面是一种替代方案,但@jezrael 的解决方案是最佳的。
性能基准测试
%timeit original(df) # 149ms
%timeit jp(df) # 81ms
%timeit jez(df) # 47ms
def original(df):
result_df = pd.DataFrame()
for diff in df['diff'].unique():
tmp_df = df.loc[df['diff']<=diff,:]
tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
tmp_df['diff'] = diff
tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)
return result_df
def jp(df):
result_df = pd.DataFrame()
lst = []
lst_append = lst.append
for diff in df['diff'].unique():
tmp_df = df.loc[df['diff']<=diff,:]
tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).agg({'user_id': 'nunique'})
tmp_df['diff'] = diff
tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
lst_append(tmp_df)
result_df = result_df.append(pd.concat(lst, ignore_index=True, axis=0), ignore_index=True)
return result_df
def jez(df):
df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
.nunique()
.reset_index()
.rename(columns={'user_id':'unique_ppl'})
.reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
return df2
Run Code Online (Sandbox Code Playgroud)