cri*_*nix 8 python numpy count dataframe pandas
我有一个由True和False组成的数据集.
Sample Table:
A B C
0 False True False
1 False False False
2 True True False
3 True True True
4 False True False
5 True True True
6 True False False
7 True False True
8 False True True
9 True False False
Run Code Online (Sandbox Code Playgroud)
我想计算每列的连续True值的数量,如果有多个连续的True系列,我想得到它的最大值.
对于上表,我会得到:
length = [3, 4, 2]
Run Code Online (Sandbox Code Playgroud)
我找到了类似的线程,但没有解决我的问题.
由于我这样做并且将会有更多的列(产品),因此无论列名如何,我都需要对整个表执行此操作,并获得一个数组作为结果.
如果可能的话,我想学习最长序列的第一个真的索引,也就是这个最长的真系列开始的地方,所以结果将是这个:
index = [5, 2, 7]
Run Code Online (Sandbox Code Playgroud)
解决方案应该简化,如果True每列至少一个:
b = df.cumsum()
c = b.sub(b.mask(df).ffill().fillna(0)).astype(int)
print (c)
A B C
0 0 1 0
1 0 0 0
2 1 1 0
3 2 2 1
4 0 3 0
5 1 4 1
6 2 0 0
7 3 0 1
8 0 1 2
9 1 0 0
#get maximal value of all columns
length = c.max().tolist()
print (length)
[3, 4, 2]
#get indexes by maximal value, subtract length and add 1
index = c.idxmax().sub(length).add(1).tolist()
print (index)
[5, 2, 7]
Run Code Online (Sandbox Code Playgroud)
细节:
print (pd.concat([b,
b.mask(df),
b.mask(df).ffill(),
b.mask(df).ffill().fillna(0),
b.sub(b.mask(df).ffill().fillna(0)).astype(int)
], axis=1,
keys=('cumsum', 'mask', 'ffill', 'fillna','sub')))
cumsum mask ffill fillna sub
A B C A B C A B C A B C A B C
0 0 1 0 0.0 NaN 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0 1 0
1 0 1 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0 0 0
2 1 2 0 NaN NaN 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1 1 0
3 2 3 1 NaN NaN NaN 0.0 1.0 0.0 0.0 1.0 0.0 2 2 1
4 2 4 1 2.0 NaN 1.0 2.0 1.0 1.0 2.0 1.0 1.0 0 3 0
5 3 5 2 NaN NaN NaN 2.0 1.0 1.0 2.0 1.0 1.0 1 4 1
6 4 5 2 NaN 5.0 2.0 2.0 5.0 2.0 2.0 5.0 2.0 2 0 0
7 5 5 3 NaN 5.0 NaN 2.0 5.0 2.0 2.0 5.0 2.0 3 0 1
8 5 6 4 5.0 NaN NaN 5.0 5.0 2.0 5.0 5.0 2.0 0 1 2
9 6 6 4 NaN 6.0 4.0 5.0 6.0 4.0 5.0 6.0 4.0 1 0 0
Run Code Online (Sandbox Code Playgroud)
编辑:
仅使用False列的常规解决方案- numpy.where使用以下创建的布尔掩码添加DataFrame.any:
print (df)
A B C
0 False True False
1 False False False
2 True True False
3 True True False
4 False True False
5 True True False
6 True False False
7 True False False
8 False True False
9 True False False
b = df.cumsum()
c = b.sub(b.mask(df).ffill().fillna(0)).astype(int)
mask = df.any()
length = np.where(mask, c.max(), -1).tolist()
print (length)
[3, 4, -1]
index = np.where(mask, c.idxmax().sub(c.max()).add(1), 0).tolist()
print (index)
[5, 2, 0]
Run Code Online (Sandbox Code Playgroud)
我们基本上会利用两种哲学 - Catching shifts on compared array和Offsetting each column results so that we could vectorize it.
因此,有了这个意图,这是实现预期结果的一种方法 -
def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
# a is a boolean array
pad = np.zeros(a.shape[1],dtype=bool)
mask = np.vstack((pad, a, pad))
mask_step = mask[1:] != mask[:-1]
idx = np.flatnonzero(mask_step.T)
island_starts = idx[::2]
island_lens = idx[1::2] - idx[::2]
n_islands_percol = mask_step.sum(0)//2
bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
scale = island_lens.max()+1
scaled_idx = np.argsort(scale*bins + island_lens)
grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]
max_island_percol_start = max_island_starts%(a.shape[0]+1)
valid = n_islands_percol!=0
cut_idx = grp_shift_idx[:-1][valid]
max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)
out_len = np.full(a.shape[1], fillna_len, dtype=int)
out_len[valid] = max_island_percol_len
out_index = np.where(valid,max_island_percol_start,fillna_index)
return out_index, out_len
Run Code Online (Sandbox Code Playgroud)
样品运行 -
# Generic case to handle all 0s columns
In [112]: a
Out[112]:
array([[False, False, False],
[False, False, False],
[ True, False, False],
[ True, False, True],
[False, False, False],
[ True, False, True],
[ True, False, False],
[ True, False, True],
[False, False, True],
[ True, False, False]])
In [117]: starts,lens = maxisland_start_len_mask(a, fillna_index=-1, fillna_len=0)
In [118]: starts
Out[118]: array([ 5, -1, 7])
In [119]: lens
Out[119]: array([3, 0, 2])
Run Code Online (Sandbox Code Playgroud)