pho*_*pho 9 python performance numpy scipy pandas
我想以一种方式过滤一个numpy array(或pandas DataFrame),只window_size保留至少长度的相同值的连续序列,并将其他所有值设置为0.
例如:
[1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1]
Run Code Online (Sandbox Code Playgroud)
当使用4的窗口大小时应该成为
[0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1]
Run Code Online (Sandbox Code Playgroud)
我已经尝试使用rolling_apply和scipy.ndimage.filtes.gerneric_filter,但由于滚动内核函数的性质,我不认为这是正确的做法在这里(和我坚持了下来,此刻).
无论如何我在这里插入我的尝试:
import numpy as np
import pandas as pd
import scipy
#from scipy import ndimage
df= pd.DataFrame({'x':np.array([1,1,1,0,0,1,1,1,1,0,0,1,0,0,0,1,1,1,0,1,1,1,1])})
df_alt = df.copy()
def filter_df(df, colname, window_size):
rolling_func = lambda z: z.sum() >= window_size
df[colname] = pd.rolling_apply(df[colname],
window_size,
rolling_func,
min_periods=window_size/2,
center = True)
def filter_alt(df, colname, window_size):
rolling_func = lambda z: z.sum() >= window_size
return scipy.ndimage.filters.generic_filter(df[colname].values,
rolling_func,
size = window_size,
origin = 0)
window_size = 4
filter_df(df, 'x', window_size)
print df
filter_alt(df_alt, 'x', window_size)
Run Code Online (Sandbox Code Playgroud)
这基本上是image closing operation in image-processing针对一维案例的.这些操作可以用卷积方法实现.现在,NumPy does support 1D convolution我们很幸运!因此,为了解决我们的情况,它会是这样的 -
def conv_app(A, WSZ):
K = np.ones(WSZ,dtype=int)
L = WSZ-1
return (np.convolve(np.convolve(A,K)>=WSZ,K)[L:-L]>0).astype(int)
Run Code Online (Sandbox Code Playgroud)
样品运行 -
In [581]: A
Out[581]: array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])
In [582]: conv_app(A,4)
Out[582]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])
In [583]: A = np.append(1,A) # Append 1 and see what happens!
In [584]: A
Out[584]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1])
In [585]: conv_app(A,4)
Out[585]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1])
Run Code Online (Sandbox Code Playgroud)
运行时测试 -
本节将对列出的其他方法进行基准测试,以解决已发布的问题.他们的定义如下 -
def groupby_app(A,WSZ): # @lambo477's solution
groups = itertools.groupby(A)
result = []
for group in groups:
group_items = [item for item in group[1]]
group_length = len(group_items)
if group_length >= WSZ:
result.extend([item for item in group_items])
else:
result.extend([0]*group_length)
return result
def stride_tricks_app(arr, window): # @ajcr's solution
x = pd.rolling_min(arr, window)
x[:window-1] = 0
y = np.lib.stride_tricks.as_strided(x, (len(x)-window+1, window), (8, 8))
y[y[:, -1] == 1] = 1
return x.astype(int)
Run Code Online (Sandbox Code Playgroud)
计时 -
In [541]: A = np.random.randint(0,2,(100000))
In [542]: WSZ = 4
In [543]: %timeit groupby_app(A,WSZ)
10 loops, best of 3: 74.5 ms per loop
In [544]: %timeit stride_tricks_app(A,WSZ)
100 loops, best of 3: 3.35 ms per loop
In [545]: %timeit conv_app(A,WSZ)
100 loops, best of 3: 2.82 ms per loop
Run Code Online (Sandbox Code Playgroud)