如何计算熊猫数据框上的连续有序值

Question

如何计算熊猫数据框上的连续有序值

我正在尝试从给定的数据框中获取连续 0 值的最大计数，其中 id,date,value 列来自 Pandas 上的数据框，如下所示：

id    date       value
354   2019-03-01 0
354   2019-03-02 0
354   2019-03-03 0
354   2019-03-04 5
354   2019-03-05 5 
354   2019-03-09 7
354   2019-03-10 0
357   2019-03-01 5
357   2019-03-02 5
357   2019-03-03 8
357   2019-03-04 0
357   2019-03-05 0
357   2019-03-06 7
357   2019-03-07 7
540   2019-03-02 7
540   2019-03-03 8
540   2019-03-04 9
540   2019-03-05 8
540   2019-03-06 7
540   2019-03-07 5
540   2019-03-08 2 
540   2019-03-09 3
540   2019-03-10 2

Run Code Online (Sandbox Code Playgroud)

所需的结果将按 Id 分组，如下所示：

id   max_consecutive_zeros
354  3
357  2
540  0

Run Code Online (Sandbox Code Playgroud)

我已经用 for 实现了我想要的，但是当你使用巨大的 Pandas 数据框时它变得非常慢，我找到了一些类似的解决方案，但它根本无法解决我的问题。

Answer 1

And*_* L. 1

为具有相同值的连续行创建 groupID m。接下来，groupby在id和m上调用和value_counts，在 multiindex 上仅对最右侧索引级别的值.loc进行切片。最后，通过in0过滤掉重复索引并重新索引以创建 0 值以表示没有计数duplicatedidid0

m = df.value.diff().ne(0).cumsum().rename('gid')    
#Consecutive rows having the same value will be assigned same IDNumber by this command. 
#It is the way to identify a group of consecutive rows having the same value, so I called it groupID.

df1 = df.groupby(['id', m]).value.value_counts().loc[:,:,0].droplevel(-1)
#this groupby groups consecutive rows of same value per ID into separate groups.
#within each group, count number of each value and `.loc` to pick specifically only `0` because we only concern on the count of value `0`.

df1[~df1.index.duplicated()].reindex(df.id.unique(), fill_value=0)
#There're several groups of value `0` per `id`. We want only group of highest count. 
#`value_count` already sorted number of count descending, so we just need to pick 
#the top one of duplicates by slicing on True/False mask of `duplicated`.
#finally, `reindex` adding any `id` doesn't have value 0 in original `df`.
#Note: `id` is the column `id` in `df`. It is different from groupID `m` we create to use with groupby

Out[315]:
id
354    3
357    2
540    0
Name: value, dtype: int64

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，6 月前
查看次数：	887 次
最近记录：	6 年，5 月前