我试图将一个函数累加应用于位于'start'和'finish'列定义的窗口内的值.因此,'start'和'finish'定义值为'active'的区间; 对于每一行,我想得到当时所有"活动"值的总和.
这是一个"强力"的例子,它完成了我所追求的目标 - 是否有更优雅,更快速或更高效的内存方式?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Run Code Online (Sandbox Code Playgroud)
最初,df是:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
Run Code Online (Sandbox Code Playgroud)
我追求的结果是:
1 100
2 300
3 500
4 700
5 1200
Run Code Online (Sandbox Code Playgroud)
numbafrom numba import njit
@njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
Run Code Online (Sandbox Code Playgroud)
np.bincounts, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Run Code Online (Sandbox Code Playgroud)
这取决于它的index独特性
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
Run Code Online (Sandbox Code Playgroud)
不依赖 index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
Run Code Online (Sandbox Code Playgroud)
使用numpyBoardcast,遗憾的是它仍然是O(n*m)解决方案,但应该比它更快groupby.到目前为止,根据我的测试,Pir的解决方案性能是最好的
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Run Code Online (Sandbox Code Playgroud)
一些时间
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
Run Code Online (Sandbox Code Playgroud)
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
205 次 |
| 最近记录: |