jez*_*ael 7 python arrays counter numpy cumulative-sum
我有一般数据,例如字符串:
np.random.seed(343)
arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str)
print (arr)
[['0' '1' '1' '2' '2' '3' '3' '4' '4' '4']
['1' '2' '2' '2' '3' '3' '3' '4' '4' '4']
['0' '2' '2' '2' '2' '3' '3' '4' '4' '4']
['0' '1' '2' '2' '3' '3' '3' '4' '4' '4']
['0' '1' '1' '1' '2' '2' '2' '2' '4' '4']
['0' '0' '1' '1' '2' '3' '3' '3' '4' '4']
['0' '0' '2' '2' '2' '2' '2' '2' '3' '4']
['0' '0' '1' '1' '1' '2' '2' '2' '3' '3']
['0' '1' '1' '2' '2' '2' '3' '4' '4' '4']
['0' '1' '1' '2' '2' '2' '2' '2' '4' '4']]
Run Code Online (Sandbox Code Playgroud)
如果累计值的计数器有差异,我需要重置计数,所以使用pandas.
首先创建DataFrame:
df = pd.DataFrame(arr)
print (df)
0 1 2 3 4 5 6 7 8 9
0 0 1 1 2 2 3 3 4 4 4
1 1 2 2 2 3 3 3 4 4 4
2 0 2 2 2 2 3 3 4 4 4
3 0 1 2 2 3 3 3 4 4 4
4 0 1 1 1 2 2 2 2 4 4
5 0 0 1 1 2 3 3 3 4 4
6 0 0 2 2 2 2 2 2 3 4
7 0 0 1 1 1 2 2 2 3 3
8 0 1 1 2 2 2 3 4 4 4
9 0 1 1 2 2 2 2 2 4 4
Run Code Online (Sandbox Code Playgroud)
它如何适用于一列:
首先比较移位数据并添加累积总和:
a = (df[0] != df[0].shift()).cumsum()
print (a)
0 1
1 2
2 3
3 3
4 3
5 3
6 3
7 3
8 3
9 3
Name: 0, dtype: int32
Run Code Online (Sandbox Code Playgroud)
然后打电话GroupBy.cumcount:
b = a.groupby(a).cumcount() + 1
print (b)
0 1
1 1
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
dtype: int64
Run Code Online (Sandbox Code Playgroud)
如果想对所有列应用解决方案可以使用apply:
print (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
0 1 2 3 4 5 6 7 8 9
0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 2 1 2 2 2 2 2
2 1 2 2 3 1 3 3 3 3 3
3 2 1 3 4 1 4 4 4 4 4
4 3 2 1 1 1 1 1 1 5 5
5 4 1 2 2 2 1 1 1 6 6
6 5 2 1 1 3 1 1 1 1 7
7 6 3 1 1 1 2 2 2 2 1
8 7 1 2 1 1 3 1 1 1 1
9 8 2 3 2 2 4 1 1 2 2
Run Code Online (Sandbox Code Playgroud)
但它很慢,因为大数据.有可能创建一些快速的numpy解决方案吗?
我发现只适用于1d阵列的解决方案.
考虑我们执行此累积计数的一般情况,或者如果您将它们视为范围,我们可以将它们称为 - 分组范围.
现在,这个想法很简单 - 比较沿着相应轴的一次性切片以寻找不等式.True在每行/列的开头填充(取决于计数轴).
然后,它变得复杂 - 设置一个ID数组,意图是我们将最终的cumsum,它将是其扁平化顺序的期望输出.因此,设置从初始化1s与输入数组具有相同形状的数组开始.在每个组开始输入时,使用先前的组长度偏移ID数组.关注我们如何为每一行做代码(应该提供更多见解) -
def grp_range_2drow(a, start=0):
# Get grouped ranges along each row with resetting at places where
# consecutive elements differ
# Input(s) : a is 2D input array
# Store shape info
m,n = a.shape
# Compare one-off slices for each row and pad with True's at starts
# Those True's indicate start of each group
p = np.ones((m,1),dtype=bool)
a1 = np.concatenate((p, a[:,:-1] != a[:,1:]),axis=1)
# Get indices of group starts in flattened version
d = np.flatnonzero(a1)
# Setup ID array to be cumsumed finally for desired o/p
# Assign into starts with previous group lengths.
# Thus, when cumsumed on flattened version would give us flattened desired
# output. Finally reshape back to 2D
c = np.ones(m*n,dtype=int)
c[d[1:]] = d[:-1]-d[1:]+1
c[0] = start
return c.cumsum().reshape(m,n)
Run Code Online (Sandbox Code Playgroud)
我们将扩展它以解决行和列的一般情况.对于列情况,我们只需转置,提供给早期的行解决方案,最后转置回来,就像这样 -
def grp_range_2d(a, start=0, axis=1):
# Get grouped ranges along specified axis with resetting at places where
# consecutive elements differ
# Input(s) : a is 2D input array
if axis not in [0,1]:
raise Exception("Invalid axis")
if axis==1:
return grp_range_2drow(a, start=start)
else:
return grp_range_2drow(a.T, start=start).T
Run Code Online (Sandbox Code Playgroud)
样品运行
让我们考虑一个样本运行,因为它会在每列中找到分组范围,每个组以1- 开头-
In [330]: np.random.seed(0)
In [331]: a = np.random.randint(1,3,(10,10))
In [333]: a
Out[333]:
array([[1, 2, 2, 1, 2, 2, 2, 2, 2, 2],
[2, 1, 1, 2, 1, 1, 1, 1, 1, 2],
[1, 2, 2, 1, 1, 2, 2, 2, 2, 1],
[2, 1, 2, 1, 2, 2, 1, 2, 2, 1],
[1, 2, 1, 2, 2, 2, 2, 2, 1, 2],
[1, 2, 2, 2, 2, 1, 2, 1, 1, 2],
[2, 1, 2, 1, 2, 1, 1, 1, 1, 1],
[2, 2, 1, 1, 1, 2, 2, 1, 2, 1],
[1, 2, 1, 2, 2, 2, 2, 2, 2, 1],
[2, 2, 1, 1, 2, 1, 1, 2, 2, 1]])
In [334]: grp_range_2d(a, start=1, axis=0)
Out[334]:
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 2, 1, 1, 1, 1, 1],
[1, 1, 2, 2, 1, 2, 1, 2, 2, 2],
[1, 1, 1, 1, 2, 3, 1, 3, 1, 1],
[2, 2, 1, 2, 3, 1, 2, 1, 2, 2],
[1, 1, 2, 1, 4, 2, 1, 2, 3, 1],
[2, 1, 1, 2, 1, 1, 1, 3, 1, 2],
[1, 2, 2, 1, 1, 2, 2, 1, 2, 3],
[1, 3, 3, 1, 2, 1, 1, 2, 3, 4]])
Run Code Online (Sandbox Code Playgroud)
因此,为了解决数据帧输入和输出的问题,它将是 -
out = grp_range_2d(df.values, start=1,axis=0)
pd.DataFrame(out,columns=df.columns,index=df.index)
Run Code Online (Sandbox Code Playgroud)
和numba解决方案.对于这样棘手的问题,它总是胜出,这里是7x因子vs numpy,因为只有一次传递res.
from numba import njit
@njit
def thefunc(arrc):
m,n=arrc.shape
res=np.empty((m+1,n),np.uint32)
res[0]=1
for i in range(1,m+1):
for j in range(n):
if arrc[i-1,j]:
res[i,j]=res[i-1,j]+1
else : res[i,j]=1
return res
def numbering(arr):return thefunc(arr[1:]==arr[:-1])
Run Code Online (Sandbox Code Playgroud)
我需要外化,arr[1:]==arr[:-1]因为numba不支持字符串.
In [75]: %timeit numbering(arr)
13.7 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [76]: %timeit grp_range_2dcol(arr)
111 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Run Code Online (Sandbox Code Playgroud)
对于更大的阵列(100 000行x 100 cols),差距不是很大:
In [168]: %timeit a=grp_range_2dcol(arr)
1.54 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [169]: %timeit a=numbering(arr)
625 ms ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
如果arr可以转换为'S8',我们可以赢得很多时间:
In [398]: %timeit arr[1:]==arr[:-1]
584 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [399]: %timeit arr.view(np.uint64)[1:]==arr.view(np.uint64)[:-1]
196 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)