New*_*ler 3 python performance pandas numba
如果我有以下数据帧,那么派生如下: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))
0
0 0
1 2
2 8
3 1
4 0
5 0
6 7
7 0
8 2
9 2
Run Code Online (Sandbox Code Playgroud)
是否有一种有效的方式cumsum具有限制的行,每次达到此限制,以启动新的cumsum.达到每个限制(但行数很多)后,将使用总库存创建一行.
下面我创建了一个执行此操作的函数示例,但它非常慢,尤其是当数据框变得非常大时.我不喜欢我的功能是循环,我正在寻找一种方法来使它更快(我猜一种没有循环的方式).
def foo(df, max_value):
last_value = 0
storage = []
for index, row in df.iterrows():
this_value = np.nansum([row[0], last_value])
if this_value >= max_value:
storage.append((index, this_value))
this_value = 0
last_value = this_value
return storage
Run Code Online (Sandbox Code Playgroud)
如果你像我这样朗读我的函数:foo(df, 5)
在上面的上下文中,它返回:
0
2 10
6 8
Run Code Online (Sandbox Code Playgroud)
循环无法避免,但可以使用numba's 并行化njit:
from numba import njit, prange
@njit
def dynamic_cumsum(seq, index, max_value):
cumsum = []
running = 0
for i in prange(len(seq)):
if running > max_value:
cumsum.append([index[i], running])
running = 0
running += seq[i]
cumsum.append([index[-1], running])
return cumsum
Run Code Online (Sandbox Code Playgroud)
假设您的索引不是数字/单调增加,则此处需要索引.
%timeit foo(df, 5)
1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5)
77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Run Code Online (Sandbox Code Playgroud)
如果索引是Int64Index类型,您可以将其缩短为:
@njit
def dynamic_cumsum2(seq, max_value):
cumsum = []
running = 0
for i in prange(len(seq)):
if running > max_value:
cumsum.append([i, running])
running = 0
running += seq[i]
cumsum.append([i, running])
return cumsum
lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
pd.DataFrame(lst, columns=['A', 'B']).set_index('A')
B
A
3 10
7 8
9 4
Run Code Online (Sandbox Code Playgroud)
%timeit foo(df, 5)
1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Run Code Online (Sandbox Code Playgroud)
njit 功能表现
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))),
kernels=[
lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)),
lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5)
],
labels=['cumsum_limit_nb', 'dynamic_cumsum2'],
n_range=[2**k for k in range(0, 17)],
xlabel='N',
logx=True,
logy=True,
equality_check=None # TODO - update when @jpp adds in the final `yield`
)
Run Code Online (Sandbox Code Playgroud)
对数 - 对数图显示对于较大的输入,生成器函数更快:
可能的解释是,随着N增加,附加到增长列表的开销dynamic_cumsum2变得突出.虽然cumsum_limit_nb只是必须yield.
循环不一定是坏的.诀窍是确保它在低级对象上执行.在这种情况下,您可以使用Numba或Cython.例如,使用生成器numba.njit:
from numba import njit
@njit
def cumsum_limit(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
idx, vals = zip(*cumsum_limit(df[0].values))
res = pd.Series(vals, index=idx)
Run Code Online (Sandbox Code Playgroud)
为了演示使用Numba进行JIT编译的性能优势:
import pandas as pd, numpy as np
from numba import njit
df = pd.DataFrame({0: [0, 2, 8, 1, 0, 0, 7, 0, 2, 2]})
@njit
def cumsum_limit_nb(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
def cumsum_limit(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
n = 10**4
df = pd.concat([df]*n, ignore_index=True)
%timeit list(cumsum_limit_nb(df[0].values)) # 4.19 ms ± 90.4 µs per loop
%timeit list(cumsum_limit(df[0].values)) # 58.3 ms ± 194 µs per loop
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
228 次 |
| 最近记录: |