pandas 数据帧中特定列后的第一个给定大小的零序列的长度

Question

pandas 数据帧中特定列后的第一个给定大小的零序列的长度

假设我有一个像这样的数据框：

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  day_90
-------------------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156   47
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231   35
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78   16
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5   3

Run Code Online (Sandbox Code Playgroud)

其中最后一列 [day_90] 包含哪一列 ([0] - [90]) 累积了每行 [total] 的 90% 的值。为了澄清这一点，以第一行为例：在第 47 列中，ID A 总共命中了他在 90 天内完成的 156 个事件中的 90%。

我需要的是：对于每一行，计算大于 7（或任何预定义的任意数字）的第一个 0 序列的长度。例如：对于第一行，我想知道第 47 列之后的第一个零序列有多长，但前提是该序列连续超过 7 个零。如果有 6 个零，然后一个非零，那么我不想数它。

最后，我想在 [day_90] 之后将此结果存储在新列中。因此，如果 ID A 在第 47 列之后有一个由 10 个零组成的序列，我想添加一个新列 [0_sequence] 来保存该 ID 的值 10。

我真的不知道从哪里开始。任何帮助表示赞赏=）

Answer 1

Cod*_*ent 5

您的问题基本上是岛屿和间隙问题的变体：非零创建一个新的“岛屿”，而 0 扩展当前的岛屿。你想找到第一个具有一定大小的岛屿。在回答您的问题之前，让我先带您了解一下该问题的简化版本。

假设您有一个系列：

>>> a = pd.Series([0,0,0,13,0,0,4,12,0,0])
0     0
1     0
2     0
3    13
4     0
5     0
6     4
7    12
8     0
9     0

Run Code Online (Sandbox Code Playgroud)

您想要找到长度至少为 3 个元素的第一个 0 序列的长度。让我们首先将它们分配到“岛屿”中：

# Every time the number is non-zero, a new "island" is created
>>> b = (a != 0).cumsum()
0    0  <-- island 0
1    0
2    0
3    1  <-- island 1
4    1
5    1
6    2  <-- island 2
7    3  <-- island 3
8    3
9    3

Run Code Online (Sandbox Code Playgroud)

对于每个岛屿，我们只对等于 0 的元素感兴趣：

>>> c = b[a == 0]
0    0
1    0
2    0
4    1
5    1
8    3
9    3

Run Code Online (Sandbox Code Playgroud)

现在让我们确定每个岛屿的大小：

>>> d = c.groupby(c).count()
0    3  <-- island 0 is of size 3
1    2  <-- island 1 is of size 2
3    2  <-- island 3 is of size 2
dtype: int64

Run Code Online (Sandbox Code Playgroud)

并过滤大小 >= 3 的岛屿：

>>> e = d[d >= 3]
0    3

Run Code Online (Sandbox Code Playgroud)

如果不为空，则答案是e(island 0, size 3)的第一个元素。e否则，没有一个岛屿符合我们的标准。

第一次尝试

并将其应用于您的问题：

def count_sequence_length(row, n):
    """Return of the length of the first sequence of 0
    after the column in `day_90` whose length is >= n
    """
    if row['day_90'] + n > 90:
        return 0
    
    # The columns after `day_90`
    idx = np.arange(row['day_90']+1, 91)

    a = row[idx]
    b = (a != 0).cumsum()
    c = b[a == 0]
    d = c.groupby(c).count()
    e = d[d >= n]
    
    return 0 if len(e) == 0 else e.iloc[0]

df['0_sequence'] = df.apply(count_sequence_length, n=7, axis=1)

Run Code Online (Sandbox Code Playgroud)

第二次尝试

上面的版本很好，但是很慢，因为它计算了所有岛屿的大小。由于您只关心第一个满足条件的岛屿的大小，因此简单的for循环运行速度要快得多：

def count_sequence_length_2(row, n):
    if row['day_90'] + n > 90:
        return 0
    
    size = 0
    for i in range(row['day_90']+1, 91):
        if row[i] == 0:
            # increase the size of the current island
            size += 1
        elif size >= n:
            # found the island we want. Search no more
            break
        else:
            # create a new island
            size = 0
    return size if size >= n else 0

df['0_sequence'] = df.apply(count_sequence_length_2, n=7, axis=1)

Run Code Online (Sandbox Code Playgroud)

当我对其进行基准测试时，这实现了 10 - 20 倍的速度提升。

归档时间：	6 年，1 月前
查看次数：	1031 次
最近记录：	5 年，5 月前