熊猫:有条件的转变

ric*_*olo 15 python datetime data-analysis pandas

有一种方法可以根据另外两列的条件来移动数据帧列吗?就像是:

df["cumulated_closed_value"] = df.groupby("user").['close_cumsum'].shiftWhile(df['close_time']>df['open_time])
Run Code Online (Sandbox Code Playgroud)

我已经找到了一种方法来做到这一点,但效率很低:

1)加载数据并创建要移位的列

df=pd.read_csv('data.csv')
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
print(df)
Run Code Online (Sandbox Code Playgroud)

输出:

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5            18
1     1 2017-01-02 2017-02-01      6             6
2     1 2017-02-03 2017-02-05      7            13
3     1 2017-02-07 2017-04-01      3            21
4     1 2017-09-07 2017-09-11      1            22
5     2 2018-01-01 2018-02-01     15            15
6     2 2018-03-01 2018-04-01      3            18
Run Code Online (Sandbox Code Playgroud)

2)使用自连接和一些过滤器移动列

自联接(这是内存效率低下) df2=pd.merge(df[['user','open_time']],df[['user','close_time','close_cumsum']], on='user')

过滤'close_time'<'open_time'.然后获取max close_time的行

df2=df2[df2['close_time']<df2['open_time']]
idx = df2.groupby(['user','open_time'])['close_time'].transform(max) == df2['close_time']
df2=df2[idx]
Run Code Online (Sandbox Code Playgroud)

3)与原始数据集合并:

df3=pd.merge(df[['user','open_time','close_time','value']],df2[['user','open_time','close_cumsum']],how='left')
print(df3)
Run Code Online (Sandbox Code Playgroud)

输出:

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5           NaN
1     1 2017-01-02 2017-02-01      6           NaN
2     1 2017-02-03 2017-02-05      7           6.0
3     1 2017-02-07 2017-04-01      3          13.0
4     1 2017-09-07 2017-09-11      1          21.0
5     2 2018-01-01 2018-02-01     15           NaN
6     2 2018-03-01 2018-04-01      3          15.0
Run Code Online (Sandbox Code Playgroud)

有更多的熊猫方法来获得相同的结果?

编辑:我添加了一条数据线,以使案例更清晰.我的目标是在新交易开始时间之前获得所有交易的总和

WeN*_*Ben 9

我在这里使用一个新的记录条件 df2['close_time']<df2['open_time']

df['New']=((df.open_time-df.close_time.shift()).dt.days>0).shift(-1)
s=df.groupby('user').apply(lambda x : (x['value']*x['New']).cumsum().shift()).reset_index(level=0,drop=True)
s.loc[~(df.New.shift()==True)]=np.nan

df['Cumsum']=s




df

Out[1043]: 
   user  open_time close_time  value    New Cumsum
0     1 2017-01-01 2017-03-01      5  False    NaN
1     1 2017-01-02 2017-02-01      6   True    NaN
2     1 2017-02-03 2017-02-05      7   True      6
3     1 2017-02-07 2017-04-01      3  False     13
4     2 2017-01-01 2017-02-01     15   True    NaN
5     2 2017-03-01 2017-04-01      3    NaN     15
Run Code Online (Sandbox Code Playgroud)

更新:因为op更新问题(来自Gabriel A的数据)

df['New']=df.user.map(df.groupby('user').close_time.apply(lambda x: np.array(x)))
df['New1']=df.user.map(df.groupby('user').value.apply(lambda x: np.array(x)))
df['New2']=[[x>m for m in y] for x,y in zip(df['open_time'],df['New'])  ]
df['Yourtarget']=list(map(sum,df['New2']*df['New1'].values))
df.drop(['New','New1','New2'],1)


Out[1376]: 
   user  open_time close_time  value  Yourtarget
0     1 2016-12-30 2016-12-31      1           0
1     1 2017-01-01 2017-03-01      5           1
2     1 2017-01-02 2017-02-01      6           1
3     1 2017-02-03 2017-02-05      7           7
4     1 2017-02-07 2017-04-01      3          14
5     1 2017-09-07 2017-09-11      1          22
6     2 2018-01-01 2018-02-01     15           0
7     2 2018-03-01 2018-04-01      3          15
Run Code Online (Sandbox Code Playgroud)

  • 这非常简洁;-) (2认同)

Gab*_*l A 6

我对你认为你应该包括的测试用例进行了修改.此解决方案可以处理您的编辑.

import pandas as pd
import numpy as np
df = pd.read_csv("cond_shift.csv")
df
Run Code Online (Sandbox Code Playgroud)

输入:

   user open_time   close_time  value
0   1   12/30/2016  12/31/2016  1
1   1   1/1/2017    3/1/2017    5
2   1   1/2/2017    2/1/2017    6
3   1   2/3/2017    2/5/2017    7
4   1   2/7/2017    4/1/2017    3
5   1   9/7/2017    9/11/2017   1
6   2   1/1/2018    2/1/2018    15
7   2   3/1/2018    4/1/2018    3
Run Code Online (Sandbox Code Playgroud)

创建要移位的列:

df["open_time"] = pd.to_datetime(df["open_time"])
df["close_time"] = pd.to_datetime(df["close_time"])
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
df


   user open_time   close_time  value   close_cumsum
0   1   2016-12-30  2016-12-31  1       1
1   1   2017-01-01  2017-03-01  5       19
2   1   2017-01-02  2017-02-01  6       7
3   1   2017-02-03  2017-02-05  7       14
4   1   2017-02-07  2017-04-01  3       22
5   1   2017-09-07  2017-09-11  1       23
6   2   2018-01-01  2018-02-01  15      15
7   2   2018-03-01  2018-04-01  3       18
Run Code Online (Sandbox Code Playgroud)

移动列(以下说明):

df["cumulated_closed_value"] = df.groupby("user")["close_cumsum"].transform("shift")
condition = ~(df.groupby("user")['close_time'].transform("shift") < df["open_time"])
df.loc[ condition,"cumulated_closed_value" ] = None
df["cumulated_closed_value"] =df.groupby("user")["cumulated_closed_value"].fillna(method="ffill").fillna(0)
df


user    open_time   close_time  value   close_cumsum    cumulated_closed_value
0   1   2016-12-30  2016-12-31  1       1               0.0
1   1   2017-01-01  2017-03-01  5       19              1.0
2   1   2017-01-02  2017-02-01  6       7               1.0
3   1   2017-02-03  2017-02-05  7       14              7.0
4   1   2017-02-07  2017-04-01  3       22              14.0
5   1   2017-09-07  2017-09-11  1       23              22.0
6   2   2018-01-01  2018-02-01  15      15              0.0
7   2   2018-03-01  2018-04-01  3       18              15.0
Run Code Online (Sandbox Code Playgroud)

所有这些都是以这样的方式编写的,它已经在所有用户中完成.如果您一次只关注一个用户,我相信逻辑会更容易.

  • 假设没有事件同时发生.这与将累积总和向下移动一行是一回事.
  • 删除与其他事件同时发生的事件.
  • 填写缺失的值.有一个前锋填补.

在你使用之前我还会彻底测试一下.时间间隔很奇怪,并且有很多边缘情况.


Joh*_*hnE 6

(注意:@wen的答案对我来说似乎很好,所以我不确定OP是在寻找更多或更多不同的东西.无论如何,这里使用的替代方法merge_asof也应该运行良好.)

首先重塑数据帧,如下所示:

lookup = ( df[['close_time','value','user']].set_index(['user','close_time'])
           .sort_index().groupby('user').cumsum().reset_index(0) )

df = df.set_index('open_time').sort_index()
Run Code Online (Sandbox Code Playgroud)

"查找"的想法只是按"close_time"排序,然后采用(分组)累积总和:

            user  value
close_time             
2017-02-01     1      6
2017-02-05     1     13
2017-03-01     1     18
2017-04-01     1     21
2017-09-11     1     22
2018-02-01     2     15
2018-04-01     2     18
Run Code Online (Sandbox Code Playgroud)

对于"df",我们只取一个原始数据帧的子集:

            user close_time  value
open_time                         
2017-01-01     1 2017-03-01      5
2017-01-02     1 2017-02-01      6
2017-02-03     1 2017-02-05      7
2017-02-07     1 2017-04-01      3
2017-09-07     1 2017-09-11      1
2018-01-01     2 2018-02-01     15
2018-03-01     2 2018-04-01      3
Run Code Online (Sandbox Code Playgroud)

从这里开始,您只想在概念上将两个数据集合并到"user"和"open_time"/"close_time",但复杂的因素是我们不希望在时间上进行精确匹配,而是一种"最近"匹配.

对于这些排序合并,您可以使用merge_asof哪个是各种非精确匹配的好工具(包括"最近","后退"和"前进").不幸的是由于包含了groupby,所以有必要循环遍历用户,但仍然是非常简单的代码:

df_merged = pd.DataFrame()

for u in df['user'].unique():
    df_merged = df_merged.append( pd.merge_asof( df[df.user==u],  lookup[lookup.user==u], 
                                                 left_index=True, right_index=True, 
                                                 direction='backward' ) )

df_merged.drop('user_y',axis=1).rename({'value_y':'close_cumsum'},axis=1)
Run Code Online (Sandbox Code Playgroud)

结果:

            user_x close_time  value_x  close_cumsum
open_time                                           
2017-01-01       1 2017-03-01        5           NaN
2017-01-02       1 2017-02-01        6           NaN
2017-02-03       1 2017-02-05        7           6.0
2017-02-07       1 2017-04-01        3          13.0
2017-09-07       1 2017-09-11        1          21.0
2018-01-01       2 2018-02-01       15           NaN
2018-03-01       2 2018-04-01        3          15.0
Run Code Online (Sandbox Code Playgroud)