如何计算DataFrame中连续TRUE的数量？

Question

如何计算DataFrame中连续TRUE的数量？

cri*_*nix 8 python numpy count dataframe pandas

我有一个由True和False组成的数据集.

Sample Table:
       A      B      C
0  False   True  False
1  False  False  False
2   True   True  False
3   True   True   True
4  False   True  False
5   True   True   True
6   True  False  False
7   True  False   True
8  False   True   True
9   True  False  False

Run Code Online (Sandbox Code Playgroud)

我想计算每列的连续True值的数量,如果有多个连续的True系列,我想得到它的最大值.

对于上表,我会得到:

length = [3, 4, 2]

Run Code Online (Sandbox Code Playgroud)

我找到了类似的线程,但没有解决我的问题.

由于我这样做并且将会有更多的列(产品),因此无论列名如何,我都需要对整个表执行此操作,并获得一个数组作为结果.

如果可能的话,我想学习最长序列的第一个真的索引,也就是这个最长的真系列开始的地方,所以结果将是这个:

index = [5, 2, 7]

Run Code Online (Sandbox Code Playgroud)

Answer 1

jez*_*ael 8

解决方案应该简化,如果True每列至少一个:

b = df.cumsum()
c = b.sub(b.mask(df).ffill().fillna(0)).astype(int)

print (c)
   A  B  C
0  0  1  0
1  0  0  0
2  1  1  0
3  2  2  1
4  0  3  0
5  1  4  1
6  2  0  0
7  3  0  1
8  0  1  2
9  1  0  0

#get maximal value of all columns
length = c.max().tolist()
print (length)
[3, 4, 2]

#get indexes by maximal value, subtract length and add 1 
index = c.idxmax().sub(length).add(1).tolist()
print (index)
[5, 2, 7]

Run Code Online (Sandbox Code Playgroud)

细节:

print (pd.concat([b,
                  b.mask(df), 
                  b.mask(df).ffill(), 
                  b.mask(df).ffill().fillna(0),
                  b.sub(b.mask(df).ffill().fillna(0)).astype(int)
                  ], axis=1, 
                  keys=('cumsum', 'mask', 'ffill', 'fillna','sub')))

  cumsum       mask           ffill           fillna           sub      
       A  B  C    A    B    C     A    B    C      A    B    C   A  B  C
0      0  1  0  0.0  NaN  0.0   0.0  NaN  0.0    0.0  0.0  0.0   0  1  0
1      0  1  0  0.0  1.0  0.0   0.0  1.0  0.0    0.0  1.0  0.0   0  0  0
2      1  2  0  NaN  NaN  0.0   0.0  1.0  0.0    0.0  1.0  0.0   1  1  0
3      2  3  1  NaN  NaN  NaN   0.0  1.0  0.0    0.0  1.0  0.0   2  2  1
4      2  4  1  2.0  NaN  1.0   2.0  1.0  1.0    2.0  1.0  1.0   0  3  0
5      3  5  2  NaN  NaN  NaN   2.0  1.0  1.0    2.0  1.0  1.0   1  4  1
6      4  5  2  NaN  5.0  2.0   2.0  5.0  2.0    2.0  5.0  2.0   2  0  0
7      5  5  3  NaN  5.0  NaN   2.0  5.0  2.0    2.0  5.0  2.0   3  0  1
8      5  6  4  5.0  NaN  NaN   5.0  5.0  2.0    5.0  5.0  2.0   0  1  2
9      6  6  4  NaN  6.0  4.0   5.0  6.0  4.0    5.0  6.0  4.0   1  0  0

Run Code Online (Sandbox Code Playgroud)

编辑:

仅使用False列的常规解决方案- numpy.where使用以下创建的布尔掩码添加DataFrame.any:

print (df)
       A      B      C
0  False   True  False
1  False  False  False
2   True   True  False
3   True   True  False
4  False   True  False
5   True   True  False
6   True  False  False
7   True  False  False
8  False   True  False
9   True  False  False

b = df.cumsum()
c = b.sub(b.mask(df).ffill().fillna(0)).astype(int)

mask = df.any()
length = np.where(mask, c.max(), -1).tolist()
print (length)
[3, 4, -1]

index =  np.where(mask, c.idxmax().sub(c.max()).add(1), 0).tolist()
print (index)
[5, 2, 0]

Run Code Online (Sandbox Code Playgroud)

Answer 2

Div*_*kar 5

我们基本上会利用两种哲学 - Catching shifts on compared array和Offsetting each column results so that we could vectorize it.

因此,有了这个意图,这是实现预期结果的一种方法 -

def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
    # a is a boolean array

    pad = np.zeros(a.shape[1],dtype=bool)
    mask = np.vstack((pad, a, pad))

    mask_step = mask[1:] != mask[:-1]
    idx = np.flatnonzero(mask_step.T)
    island_starts = idx[::2]
    island_lens = idx[1::2] - idx[::2]
    n_islands_percol = mask_step.sum(0)//2

    bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
    scale = island_lens.max()+1

    scaled_idx = np.argsort(scale*bins + island_lens)
    grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
    max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]

    max_island_percol_start = max_island_starts%(a.shape[0]+1)

    valid = n_islands_percol!=0
    cut_idx = grp_shift_idx[:-1][valid]
    max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)

    out_len = np.full(a.shape[1], fillna_len, dtype=int)
    out_len[valid] = max_island_percol_len
    out_index = np.where(valid,max_island_percol_start,fillna_index)
    return out_index, out_len

Run Code Online (Sandbox Code Playgroud)

样品运行 -

# Generic case to handle all 0s columns
In [112]: a
Out[112]: 
array([[False, False, False],
       [False, False, False],
       [ True, False, False],
       [ True, False,  True],
       [False, False, False],
       [ True, False,  True],
       [ True, False, False],
       [ True, False,  True],
       [False, False,  True],
       [ True, False, False]])

In [117]: starts,lens = maxisland_start_len_mask(a, fillna_index=-1, fillna_len=0)

In [118]: starts
Out[118]: array([ 5, -1,  7])

In [119]: lens
Out[119]: array([3, 0, 2])

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，4 月前
查看次数：	1079 次
最近记录：	7 年，4 月前