Chi*_*ion 5 python group-by date-difference pandas rolling-sum
我有一个数据帧:
ID DATE WIN
A 2015/6/5 Yes
A 2015/6/7 Yes
A 2015/6/7 Yes
A 2015/6/7 Yes
B 2015/6/8 No
B 2015/8/7 Yes
C 2015/5/15 Yes
C 2015/5/30 No
C 2015/7/30 No
C 2015/8/03 Yes
Run Code Online (Sandbox Code Playgroud)
我想添加一个列,计算ID过去1个月内每个胜利的数量,结果如下:
ID DATE WIN NumOfDaysSinceLastWin NumOfWinsInThePast30days
A 2015/6/5 Yes 0 0
A 2015/6/7 Yes 2 1
A 2015/6/7 Yes 2 1 or (A 2015/6/7 Yes 0 2)
A 2015/6/8 No 1 3
B 2015/8/7 No 0 0
B 2015/8/7 Yes 0 0
C 2015/5/15 Yes 0 0
C 2015/5/30 No 15 1
C 2015/7/30 No 76 0
C 2015/8/03 Yes 80 0
Run Code Online (Sandbox Code Playgroud)
我怎样才能使用groupby函数并timegrouper得到它?
输入数据必须按日期在每组中排序,在此数据中可以。
输入数据无法很好地映射情况,因此将添加接下来的 4 行。
列WIN1是根据和 的-WIN值创建的。我的两个输出列都需要它。1'Yes'0'No'
df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
Run Code Online (Sandbox Code Playgroud)
NumOfDaysSinceLastWincumsum首先创建列(累计和)。
df['cumsum'] = df['WIN1'].cumsum()
Run Code Online (Sandbox Code Playgroud)
如果都是的WIN话'Yes',那就容易了。数据将被分组,日期和前一个日期(-1)值之间的差异在 columns 中diffs。
#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0))
Run Code Online (Sandbox Code Playgroud)
但情况很复杂,因为'No'列的值WIN。因此,如果 value 是'Yes',则需要与前 'Yes'一个差异,如果'No'需要与上一个前一个 'WIN'差异。差异可以通过多种方式计算,但可以通过减去两列 -DATE和列来选择date1。
列date1
行必须按特殊的方式值'No'和上一个进行分组'Yes'。可以通过列中的累积和来实现cumsum。然后该组的最小值是'Yes'列的值,然后该值重复到具有'No'值的行。列count是特殊的——cumsum列中没有重复值1。重复的按组递增。
df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')
Run Code Online (Sandbox Code Playgroud)
前一行中的值的日期'YES'对于差异是必需的。Dataframedf1仅过滤'Yes'df 的值,然后按 column 对其进行分组ID。索引不变,因此输出可以映射到 dataframe 的新列df。
df1 = df[~df['WIN'].isin(['No'])]
df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift())
print df
ID DATE WIN WIN1 cumsum min count date1
0 A 2015-06-05 Yes 1 1 2015-06-05 1 NaT
1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 4 NaT
6 B 2015-06-07 No 0 5 2015-06-07 4 NaT
7 B 2015-06-07 No 0 5 2015-06-07 4 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 1 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 3 NaT
10 C 2015-05-30 No 0 7 2015-05-15 3 NaT
11 C 2015-07-30 No 0 7 2015-05-15 3 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03
Run Code Online (Sandbox Code Playgroud)
然后日期列min(值'No'和上一个值'Yes')和列date1(其他值'Yes')可以按列连接count。
添加了新条件 - 列 的值date1将为 null - ( NaT),因为这些值将被 列 覆盖min。
df.loc[(df['count'] > 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df
ID DATE WIN WIN1 cumsum min count date1
0 A 2015-06-05 Yes 1 1 2015-06-05 1 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 1 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 1 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 1 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 4 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 4 2015-06-07
6 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07
7 B 2015-06-07 No 0 5 2015-06-07 4 2015-06-07
8 B 2015-08-07 Yes 1 6 2015-08-07 1 2015-08-07
9 C 2015-05-15 Yes 1 7 2015-05-15 3 2015-05-15
10 C 2015-05-30 No 0 7 2015-05-15 3 2015-05-15
11 C 2015-07-30 No 0 7 2015-05-15 3 2015-05-15
12 C 2015-08-03 Yes 1 8 2015-08-03 1 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 1 2015-08-03
Run Code Online (Sandbox Code Playgroud)
重复日期时间 - 子解决方案
抱歉,如果这是如此复杂的方式,也许有人会找到更好的。
我的解决方案是找到重复值,按上一个填充它们'Yes'并添加到列中date1以求差异。
这些值在列中标识count。其他(值1)重置为NaN。然后 from 的值date1被复制到date2by column count。
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df
ID DATE WIN WIN1 cumsum min count date1 date2
0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-07 2015-06-07
4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-07 2015-06-07
5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT
6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT
10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-08-03 2015-08-03
Run Code Online (Sandbox Code Playgroud)
然后该值按组的最小值重复并添加到date1列中。
def repeat_value(grp):
grp['date2'] = grp['date2'].min()
return grp
df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df1['date2'].notnull() , 'date1'] = df['date2']
print df
ID DATE WIN WIN1 cumsum min count date1 date2
0 A 2015-06-05 Yes 1 1 2015-06-05 2 2015-06-05 2015-06-05
1 A 2015-06-05 Yes 1 2 2015-06-05 2 2015-06-05 2015-06-05
2 A 2015-06-07 Yes 1 3 2015-06-07 3 2015-06-05 2015-06-05
3 A 2015-06-07 Yes 1 4 2015-06-07 3 2015-06-05 2015-06-05
4 A 2015-06-07 Yes 1 5 2015-06-07 3 2015-06-05 2015-06-05
5 A 2015-06-08 No 0 5 2015-06-07 NaN 2015-06-07 NaT
6 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
7 B 2015-06-07 No 0 5 2015-06-07 NaN 2015-06-07 NaT
8 B 2015-08-07 Yes 1 6 2015-08-07 NaN 2015-08-07 NaT
9 C 2015-05-15 Yes 1 7 2015-05-15 NaN 2015-05-15 NaT
10 C 2015-05-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
11 C 2015-07-30 No 0 7 2015-05-15 NaN 2015-05-15 NaT
12 C 2015-08-03 Yes 1 8 2015-08-03 2 2015-05-15 2015-05-15
13 C 2015-08-03 Yes 1 9 2015-08-03 2 2015-05-15 2015-05-15
Run Code Online (Sandbox Code Playgroud)
列由 col和NumOfDaysSinceLastWin的差异填充。数据类型为,因此将转换为整数。最后,不需要的列将被删除。(只有列和是下一个输出列所必需的,因此不会被删除。)date1DATETimedeltaWIN1count
df['NumOfDaysSinceLastWin'] = ((df['DATE'] - df['date1']).fillna(0)).astype('timedelta64[D]')
df = df.drop(['cumsum','min', 'date1'], axis=1 )
print df
ID DATE WIN WIN1 count NumOfDaysSinceLastWin
0 A 2015-06-05 Yes 1 2 0
1 A 2015-06-05 Yes 1 2 0
2 A 2015-06-07 Yes 1 3 2
3 A 2015-06-07 Yes 1 3 2
4 A 2015-06-07 Yes 1 3 2
5 A 2015-06-08 No 0 NaN 1
6 B 2015-06-07 No 0 NaN 0
7 B 2015-06-07 No 0 NaN 0
8 B 2015-08-07 Yes 1 NaN 0
9 C 2015-05-15 Yes 1 NaN 0
10 C 2015-05-30 No 0 NaN 15
11 C 2015-07-30 No 0 NaN 76
12 C 2015-08-03 Yes 1 2 80
13 C 2015-08-03 Yes 1 2 80
Run Code Online (Sandbox Code Playgroud)
NumOfWinsInThePast30days滚动总和是你的朋友。列(重采样所需)由for和foryes映射。1'Yes'NaN'No'
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
Run Code Online (Sandbox Code Playgroud)
数据框df2是副本df,列DATE设置为索引(用于重新采样)。不需要的列将被删除。
df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1'], axis=1)
Run Code Online (Sandbox Code Playgroud)
然后df2将按天重新采样,如果行是'Yes'值为1,如果'No'是0。(最好看下面的解释。)
df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()
Run Code Online (Sandbox Code Playgroud)
数据框df2将按这些组使用的ID功能进行分组。rolling_sum
df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)
Run Code Online (Sandbox Code Playgroud)
df2为了更好地理解将显示所有行。
with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
print df2
ID DATE yes rollsum
0 A 2015-06-05 2 2
1 A 2015-06-06 0 2
2 A 2015-06-07 3 5
3 A 2015-06-08 0 5
4 B 2015-06-07 0 0
5 B 2015-06-08 0 0
6 B 2015-06-09 0 0
7 B 2015-06-10 0 0
8 B 2015-06-11 0 0
9 B 2015-06-12 0 0
10 B 2015-06-13 0 0
11 B 2015-06-14 0 0
12 B 2015-06-15 0 0
13 B 2015-06-16 0 0
14 B 2015-06-17 0 0
15 B 2015-06-18 0 0
16 B 2015-06-19 0 0
17 B 2015-06-20 0 0
18 B 2015-06-21 0 0
19 B 2015-06-22 0 0
20 B 2015-06-23 0 0
21 B 2015-06-24 0 0
22 B 2015-06-25 0 0
23 B 2015-06-26 0 0
24 B 2015-06-27 0 0
25 B 2015-06-28 0 0
26 B 2015-06-29 0 0
27 B 2015-06-30 0 0
28 B 2015-07-01 0 0
29 B 2015-07-02 0 0
30 B 2015-07-03 0 0
31 B 2015-07-04 0 0
32 B 2015-07-05 0 0
33 B 2015-07-06 0 0
34 B 2015-07-07 0 0
35 B 2015-07-08 0 0
36 B 2015-07-09 0 0
37 B 2015-07-10 0 0
38 B 2015-07-11 0 0
39 B 2015-07-12 0 0
40 B 2015-07-13 0 0
41 B 2015-07-14 0 0
42 B 2015-07-15 0 0
43 B 2015-07-16 0 0
44 B 2015-07-17 0 0
45 B 2015-07-18 0 0
46 B 2015-07-19 0 0
47 B 2015-07-20 0 0
48 B 2015-07-21 0 0
49 B 2015-07-22 0 0
50 B 2015-07-23 0 0
51 B 2015-07-24 0 0
52 B 2015-07-25 0 0
53 B 2015-07-26 0 0
54 B 2015-07-27 0 0
55 B 2015-07-28 0 0
56 B 2015-07-29 0 0
57 B 2015-07-30 0 0
58 B 2015-07-31 0 0
59 B 2015-08-01 0 0
60 B 2015-08-02 0 0
61 B 2015-08-03 0 0
62 B 2015-08-04 0 0
63 B 2015-08-05 0 0
64 B 2015-08-06 0 0
65 B 2015-08-07 1 1
66 C 2015-05-15 1 1
67 C 2015-05-16 0 1
68 C 2015-05-17 0 1
69 C 2015-05-18 0 1
70 C 2015-05-19 0 1
71 C 2015-05-20 0 1
72 C 2015-05-21 0 1
73 C 2015-05-22 0 1
74 C 2015-05-23 0 1
75 C 2015-05-24 0 1
76 C 2015-05-25 0 1
77 C 2015-05-26 0 1
78 C 2015-05-27 0 1
79 C 2015-05-28 0 1
80 C 2015-05-29 0 1
81 C 2015-05-30 0 1
82 C 2015-05-31 0 1
83 C 2015-06-01 0 1
84 C 2015-06-02 0 1
85 C 2015-06-03 0 1
86 C 2015-06-04 0 1
87 C 2015-06-05 0 1
88 C 2015-06-06 0 1
89 C 2015-06-07 0 1
90 C 2015-06-08 0 1
91 C 2015-06-09 0 1
92 C 2015-06-10 0 1
93 C 2015-06-11 0 1
94 C 2015-06-12 0 1
95 C 2015-06-13 0 1
96 C 2015-06-14 0 0
97 C 2015-06-15 0 0
98 C 2015-06-16 0 0
99 C 2015-06-17 0 0
100 C 2015-06-18 0 0
101 C 2015-06-19 0 0
102 C 2015-06-20 0 0
103 C 2015-06-21 0 0
104 C 2015-06-22 0 0
105 C 2015-06-23 0 0
106 C 2015-06-24 0 0
107 C 2015-06-25 0 0
108 C 2015-06-26 0 0
109 C 2015-06-27 0 0
110 C 2015-06-28 0 0
111 C 2015-06-29 0 0
112 C 2015-06-30 0 0
113 C 2015-07-01 0 0
114 C 2015-07-02 0 0
115 C 2015-07-03 0 0
116 C 2015-07-04 0 0
117 C 2015-07-05 0 0
118 C 2015-07-06 0 0
119 C 2015-07-07 0 0
120 C 2015-07-08 0 0
121 C 2015-07-09 0 0
122 C 2015-07-10 0 0
123 C 2015-07-11 0 0
124 C 2015-07-12 0 0
125 C 2015-07-13 0 0
126 C 2015-07-14 0 0
127 C 2015-07-15 0 0
128 C 2015-07-16 0 0
129 C 2015-07-17 0 0
130 C 2015-07-18 0 0
131 C 2015-07-19 0 0
132 C 2015-07-20 0 0
133 C 2015-07-21 0 0
134 C 2015-07-22 0 0
135 C 2015-07-23 0 0
136 C 2015-07-24 0 0
137 C 2015-07-25 0 0
138 C 2015-07-26 0 0
139 C 2015-07-27 0 0
140 C 2015-07-28 0 0
141 C 2015-07-29 0 0
142 C 2015-07-30 0 0
143 C 2015-07-31 0 0
144 C 2015-08-01 0 0
145 C 2015-08-02 0 0
146 C 2015-08-03 2 2
Run Code Online (Sandbox Code Playgroud)
不需要的列yes将被删除。
df2 = df2.drop(['yes'], axis=1 )
Run Code Online (Sandbox Code Playgroud)
输出与第一个数据帧合并df。
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2
ID DATE WIN WIN1 NumOfDaysSinceLastWin yes rollsum
0 A 2015-06-07 Yes 1 0 1 2
1 A 2015-06-07 Yes 1 0 1 2
2 B 2015-08-07 No 0 0 NaN 1
3 B 2015-08-07 Yes 1 0 1 1
4 C 2015-05-15 Yes 1 0 1 1
5 C 2015-05-30 No 0 15 NaN 1
6 C 2015-07-30 No 0 76 NaN 0
7 C 2015-08-03 Yes 1 80 1 1
Run Code Online (Sandbox Code Playgroud)
如果column 中的值count不是null,则将它们添加到column 中count。函数rolling_sum计算原始df值的行数'YES',因此必须进行减法。这个值 ( 1) 在列中WIN1。
df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']
Run Code Online (Sandbox Code Playgroud)
删除不需要的列。
df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2
ID DATE WIN NumOfDaysSinceLastWin NumOfWinsInThePast30days
0 A 2015-06-05 Yes 0 0
1 A 2015-06-05 Yes 0 0
2 A 2015-06-07 Yes 2 2
3 A 2015-06-07 Yes 2 2
4 A 2015-06-07 Yes 2 2
5 A 2015-06-08 No 1 5
6 B 2015-06-07 No 0 0
7 B 2015-06-07 No 0 0
8 B 2015-08-07 Yes 0 0
9 C 2015-05-15 Yes 0 0
10 C 2015-05-30 No 15 1
11 C 2015-07-30 No 76 0
12 C 2015-08-03 Yes 80 0
13 C 2015-08-03 Yes 80 0
Run Code Online (Sandbox Code Playgroud)
最后一起说:
import pandas as pd
import numpy as np
import io
#original data
temp=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes"""
#changed repeating data
temp2=u"""ID,DATE,WIN
A,2015/6/5,Yes
A,2015/6/5,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/7,Yes
A,2015/6/8,No
B,2015/6/7,No
B,2015/6/7,No
B,2015/8/7,Yes
C,2015/5/15,Yes
C,2015/5/30,No
C,2015/7/30,No
C,2015/8/03,Yes
C,2015/8/03,Yes"""
df = pd.read_csv(io.StringIO(temp2), parse_dates = [1])
df['WIN1'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else 0)
df['cumsum'] = df['WIN1'].cumsum()
#df['diffs'] = df.groupby(['ID', 'cumsum'])['DATE'].apply(lambda d: (d-d.shift()).fillna(0))
df['min'] = df.groupby(['ID','cumsum'])['DATE'].transform('min')
df['count'] = df.groupby(['cumsum'])['cumsum'].transform('count')
df1 = df[~df['WIN'].isin(['No'])]
df['date1'] = df1.groupby(['ID'])['DATE'].apply(lambda d: d.shift())
print df
df.loc[(df['count'] >= 1) & (df['date1'].isnull()), 'date1'] = df['min']
print df
#resolve repeating datetimes
df['count'] = df1.groupby(['ID', 'DATE', 'WIN1'])['WIN1'].transform('count')
df.loc[df['count'] == 1 , 'count'] = np.nan
df.loc[df['count'].notnull() , 'date2'] = df['date1']
print df
def repeat_value(grp):
grp['date2'] = grp['date2'].min()
return grp
df = df.groupby(['ID', 'DATE']).apply(repeat_value)
df.loc[df['date2'].notnull() , 'date1'] = df['date2']
print df
df['NumOfDaysSinceLastWin'] = (df['DATE'] - df['date1']).astype('timedelta64[D]')
df = df.drop(['cumsum','min','date1', 'date2'], axis=1 )
print df
#NumOfWinsInThePast30days
df['yes'] = df['WIN'].map(lambda x: 1 if x == 'Yes' else np.nan)
df2 = df.set_index('DATE')
df2 = df2.drop(['NumOfDaysSinceLastWin','WIN', 'WIN1','count'], axis=1)
df2 = df2.groupby('ID').resample("D", how='count')
df2 = df2.reset_index()
df2['rollsum'] = df2.groupby('ID')['yes'].transform(pd.rolling_sum, window=30, min_periods=1)
#with pd.option_context('display.max_rows', 999, 'display.max_columns', 5):
#print df2
df2 = df2.drop(['yes'], axis=1 )
df2 = pd.merge(df,df2,on=['DATE', 'ID'], how='inner')
print df2
df2.loc[df['count'].notnull() , 'WIN1'] = df2['count']
df2['NumOfWinsInThePast30days'] = df2['rollsum'] - df2['WIN1']
df2 = df2.drop(['yes','WIN1', 'rollsum', 'count'], axis=1 )
print df2
Run Code Online (Sandbox Code Playgroud)