我正在尝试从给定的数据框中获取连续 0 值的最大计数,其中 id,date,value 列来自 Pandas 上的数据框,如下所示:
id date value
354 2019-03-01 0
354 2019-03-02 0
354 2019-03-03 0
354 2019-03-04 5
354 2019-03-05 5
354 2019-03-09 7
354 2019-03-10 0
357 2019-03-01 5
357 2019-03-02 5
357 2019-03-03 8
357 2019-03-04 0
357 2019-03-05 0
357 2019-03-06 7
357 2019-03-07 7
540 2019-03-02 7
540 2019-03-03 8
540 2019-03-04 9
540 2019-03-05 8
540 2019-03-06 7
540 2019-03-07 5
540 2019-03-08 2
540 2019-03-09 3
540 2019-03-10 2
Run Code Online (Sandbox Code Playgroud)
所需的结果将按 Id 分组,如下所示:
id max_consecutive_zeros
354 3
357 2
540 0
Run Code Online (Sandbox Code Playgroud)
我已经用 for 实现了我想要的,但是当你使用巨大的 Pandas 数据框时它变得非常慢,我找到了一些类似的解决方案,但它根本无法解决我的问题。
为具有相同值的连续行创建 groupID m。接下来,groupby在id和m上调用 和value_counts,在 multiindex 上仅对最右侧索引级别的值.loc进行切片。最后,通过in0过滤掉重复索引并重新索引以创建 0 值以表示没有计数duplicatedidid0
m = df.value.diff().ne(0).cumsum().rename('gid')
#Consecutive rows having the same value will be assigned same IDNumber by this command.
#It is the way to identify a group of consecutive rows having the same value, so I called it groupID.
df1 = df.groupby(['id', m]).value.value_counts().loc[:,:,0].droplevel(-1)
#this groupby groups consecutive rows of same value per ID into separate groups.
#within each group, count number of each value and `.loc` to pick specifically only `0` because we only concern on the count of value `0`.
df1[~df1.index.duplicated()].reindex(df.id.unique(), fill_value=0)
#There're several groups of value `0` per `id`. We want only group of highest count.
#`value_count` already sorted number of count descending, so we just need to pick
#the top one of duplicates by slicing on True/False mask of `duplicated`.
#finally, `reindex` adding any `id` doesn't have value 0 in original `df`.
#Note: `id` is the column `id` in `df`. It is different from groupID `m` we create to use with groupby
Out[315]:
id
354 3
357 2
540 0
Name: value, dtype: int64
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
887 次 |
| 最近记录: |