sla*_*law 5 python arrays performance numpy
我有一个很长的包含1_000_000_000元素的 NumPy 数组,我想50在数组上滑动一个元素窗口,并询问窗口内的所有元素是否都是有限的。如果元素窗口内的所有元素50都是有限的,则返回True(对于该窗口),否则,如果50元素窗口内的一个或多个元素不是有限的,则返回False(对于该窗口)。继续此评估,直到评估完所有窗口。一个很好的方法是:
import numpy as np
def rolling_window(a, window):
a = np.asarray(a)
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
if __name__ == "__main__":
a = np.random.rand(100_000_000) # This is 10x shorter than my real data
w = 50
idx = np.random.randint(0, len(a), size=len(a)//10) # Simulate having np.nan in my array
a[idx] = np.nan
print(np.all(rolling_window(np.isfinite(a), w), axis=1))
Run Code Online (Sandbox Code Playgroud)
但是,当我的数组长度为 length 时,这很慢1_000_000_000。有没有一种更快的方法来完成此任务,并且不需要大量内存?
方法#1:滥用跨步窗口直接进入isfinite-mask分配 -
def strided_allfinite(a, w):\n m = np.isfinite(a)\n p = rolling_window(m, w)\n nmW = ~m[:w]\n if nmW.any():\n m[:np.flatnonzero(nmW).max()] = False\n p[~m[w-1:]] = False\n return m[:-w+1]\nRun Code Online (Sandbox Code Playgroud)\n给定样本数据的计时:
\nIn [323]: N = 100_000_000\n ...: w = 50\n ...: \n ...: np.random.seed(0)\n ...: a = np.random.rand(N) # This is 10x shorter than my real data\n ...: idx = np.random.randint(0, len(a), size=len(a)//10) # Simulate...\n ...: a[idx] = np.nan\n\n# Original soln\nIn [324]: %timeit np.all(rolling_window(np.isfinite(a), w), axis=1)\n1.61 s \xc2\xb1 14.5 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [325]: %timeit strided_allfinite(a, w)\n556 ms \xc2\xb1 87.9 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n方法#2
\n我们可以利用convolution——
np.convolve(np.isfinite(a), np.ones(w),\'valid\')==w\nRun Code Online (Sandbox Code Playgroud)\n方法#3
\n\nfrom scipy.ndimage.morphology import binary_erosion\n\nm = np.isfinite(a)\nout = binary_erosion(m, np.ones(w, dtype=bool))[w//2:len(a)-w+1+w//2]\nRun Code Online (Sandbox Code Playgroud)\n