如何使用pandas识别大约(阈值定义的)连续非空数据?

Del*_*rge 2 python numpy time-series scipy pandas

我想从降雨时间系列中提取降雨事件,同时在同一事件中允许X干小时(作为参数).因此,通过降雨事件,我的意思是大约连续降雨(RF> 0),内部最大X连续干小时(RF = 0).

我实际上不想用迭代器和增量的方式来做它,我寻找可以放心的pandas或numpy/scipy工具.

这是我的数据帧的示例.RF是原始降雨量,RFfillRF.interpolate()填充nodata.evtId是为了存储事件唯一ID而创建的字段.

                    TS   RF  RFfill  evtId
0  1997-11-27 14:00:00  0.3     0.3    NaN
1  1997-11-27 15:00:00  1.1     1.1    NaN
2  1997-11-27 16:00:00  0.2     0.2    NaN
3  1997-11-27 17:00:00  0.0     0.0    NaN
4  1997-11-27 18:00:00  0.0     0.0    NaN
5  1997-11-27 19:00:00  1.1     1.1    NaN
6  1997-11-27 20:00:00  0.6     0.6    NaN
7  1997-11-27 21:00:00  0.0     0.0    NaN
8  1997-11-27 22:00:00  0.0     0.0    NaN
9  1997-11-27 23:00:00  0.0     0.0    NaN
10 1997-11-28 00:00:00  0.0     0.0    NaN
11 1997-11-28 01:00:00  0.0     0.0    NaN
12 1997-11-28 02:00:00  0.0     0.0    NaN
13 1997-11-28 03:00:00  0.0     0.0    NaN
14 1997-11-28 04:00:00  0.0     0.0    NaN
15 1997-11-28 05:00:00  0.0     0.0    NaN
16 1997-11-28 06:00:00  0.0     0.0    NaN
17 1997-11-28 07:00:00  0.0     0.0    NaN
18 1997-11-28 08:00:00  0.0     0.0    NaN
19 1997-11-28 09:00:00  0.8     0.8    NaN
20 1997-11-28 10:00:00  1.1     1.1    NaN
21 1997-11-28 11:00:00  2.3     2.3    NaN
22 1997-11-28 12:00:00  1.4     1.4    NaN
23 1997-11-28 13:00:00  0.4     0.4    NaN
24 1997-11-28 14:00:00  0.2     0.2    NaN
25 1997-11-28 15:00:00  0.0     0.0    NaN
26 1997-11-28 16:00:00  0.0     0.0    NaN
27 1997-11-28 17:00:00  0.0     0.0    NaN
28 1997-11-28 18:00:00  0.0     0.0    NaN
29 1997-11-28 19:00:00  0.0     0.0    NaN
30 1997-11-28 20:00:00  0.0     0.0    NaN
Run Code Online (Sandbox Code Playgroud)

这是预计的产量,允许干燥时间为5小时:

                    TS   RF  RFfill  evtId
0  1997-11-27 14:00:00  0.3     0.3    0
1  1997-11-27 15:00:00  1.1     1.1    0
2  1997-11-27 16:00:00  0.2     0.2    0
3  1997-11-27 17:00:00  0.0     0.0    0
4  1997-11-27 18:00:00  0.0     0.0    0
5  1997-11-27 19:00:00  1.1     1.1    0
6  1997-11-27 20:00:00  0.6     0.6    0
7  1997-11-27 21:00:00  0.0     0.0    NaN
8  1997-11-27 22:00:00  0.0     0.0    NaN
9  1997-11-27 23:00:00  0.0     0.0    NaN
10 1997-11-28 00:00:00  0.0     0.0    NaN
11 1997-11-28 01:00:00  0.0     0.0    NaN
12 1997-11-28 02:00:00  0.0     0.0    NaN
13 1997-11-28 03:00:00  0.0     0.0    NaN
14 1997-11-28 04:00:00  0.0     0.0    NaN
15 1997-11-28 05:00:00  0.0     0.0    NaN
16 1997-11-28 06:00:00  0.0     0.0    NaN
17 1997-11-28 07:00:00  0.0     0.0    NaN
18 1997-11-28 08:00:00  0.0     0.0    NaN
19 1997-11-28 09:00:00  0.8     0.8    1
20 1997-11-28 10:00:00  1.1     1.1    1
21 1997-11-28 11:00:00  2.3     2.3    1
22 1997-11-28 12:00:00  1.4     1.4    1
23 1997-11-28 13:00:00  0.4     0.4    1
24 1997-11-28 14:00:00  0.2     0.2    1
25 1997-11-28 15:00:00  0.0     0.0    NaN
26 1997-11-28 16:00:00  0.0     0.0    NaN
27 1997-11-28 17:00:00  0.0     0.0    NaN
28 1997-11-28 18:00:00  0.0     0.0    NaN
29 1997-11-28 19:00:00  0.0     0.0    NaN
30 1997-11-28 20:00:00  0.0     0.0    NaN
Run Code Online (Sandbox Code Playgroud)

任何可以帮助我实现这一目标的想法?

unu*_*tbu 5

import numpy as np
import pandas as pd
import scipy.ndimage as ndimage

df = pd.DataFrame({'RF': [ 0.3,  1.1,  0.2,  0. ,  0. ,  0. ,  0. ,  0. ,  
                           1.1,  0.6,  0. , 0. ,  0. ,  0. ,  0. ,  0. ,  
                           0.8,  1.1,  2.3,  1.4,  0.4,  0.2, 0. ,  0. ,  
                           0. ,  0. ,  0. ,  0. ]})

consecutive = 5
mask = df['RF'] > 0
df['mask'] = mask
df['dilation'] = ndimage.binary_dilation(mask, structure=[1]*(consecutive+1))
df['erosion'] = ndimage.binary_erosion(df['dilation'], 
    structure=[1]*(consecutive+1), border_value=1)
df['labeled'], nobjs = ndimage.label(df['erosion'])
df['evtId'] = np.where(df['labeled'] > 0, df['labeled']-1, np.nan)
print(df[['RF', 'evtId']])
Run Code Online (Sandbox Code Playgroud)

产量

#      RF  evtId
# 0   0.3      0
# 1   1.1      0
# 2   0.2      0
# 3   0.0      0
# 4   0.0      0
# 5   0.0      0
# 6   0.0      0
# 7   0.0      0
# 8   1.1      0
# 9   0.6      0
# 10  0.0    NaN
# 11  0.0    NaN
# 12  0.0    NaN
# 13  0.0    NaN
# 14  0.0    NaN
# 15  0.0    NaN
# 16  0.8      1
# 17  1.1      1
# 18  2.3      1
# 19  1.4      1
# 20  0.4      1
# 21  0.2      1
# 22  0.0    NaN
# 23  0.0    NaN
# 24  0.0    NaN
# 25  0.0    NaN
# 26  0.0    NaN
# 27  0.0    NaN
Run Code Online (Sandbox Code Playgroud)

说明:首先准备一个二进制掩码,其中为True df['RF'] > 0:

mask = (df['RF'] > 0)
df['mask'] = mask
#      RF   mask
# 0   0.3   True
# 1   1.1   True
# 2   0.2   True
# 3   0.0  False
# 4   0.0  False
# 5   0.0  False
# 6   0.0  False
# 7   0.0  False
# 8   1.1   True
# 9   0.6   True
# ...
Run Code Online (Sandbox Code Playgroud)

接下来,扩大面具以将Trues(雨天)的岛屿连接在一起,相隔5个或更少的Falses(非雨天):

df['dilation'] = ndimage.binary_dilation(mask, structure=[1]*(consecutive+1))
#      RF   mask dilation
# 0   0.3   True     True
# 1   1.1   True     True
# 2   0.2   True     True
# 3   0.0  False     True   <--, 
# 4   0.0  False     True      |
# 5   0.0  False     True      |  dilation filled over 5 rainy days
# 6   0.0  False     True      |
# 7   0.0  False     True   <--'
# 8   1.1   True     True
# 9   0.6   True     True
# 10  0.0  False     True   <-- But the `True`s extend a bit too far
# 11  0.0  False     True   <--
# 12  0.0  False    False
# 13  0.0  False     True
# 14  0.0  False     True
# 15  0.0  False     True
# 16  0.8   True     True
# 17  1.1   True     True
# 18  2.3   True     True
# 19  1.4   True     True
# 20  0.4   True     True
# 21  0.2   True     True
# 22  0.0  False     True
# 23  0.0  False     True
# 24  0.0  False    False
# 25  0.0  False    False
# 26  0.0  False    False
# 27  0.0  False    False
Run Code Online (Sandbox Code Playgroud)

接下来使用二进制侵蚀来移除True已经扩展得太远的s.

df['erosion'] = ndimage.binary_erosion(df['dilation'], structure=[1]*(consecutive+1), 
                                       border_value=1)
#      RF   mask dilation erosion
# 0   0.3   True     True    True
# 1   1.1   True     True    True
# 2   0.2   True     True    True
# 3   0.0  False     True    True
# 4   0.0  False     True    True
# 5   0.0  False     True    True
# 6   0.0  False     True    True
# 7   0.0  False     True    True
# 8   1.1   True     True    True
# 9   0.6   True     True    True
# 10  0.0  False     True   False  <--,
# 11  0.0  False     True   False     |
# 12  0.0  False    False   False     | The Falses have been expanded
# 13  0.0  False     True   False     | (The Trues eroded)
# 14  0.0  False     True   False     |
# 15  0.0  False     True   False  <--'
# 16  0.8   True     True    True
# 17  1.1   True     True    True
# 18  2.3   True     True    True
# 19  1.4   True     True    True
# 20  0.4   True     True    True
# 21  0.2   True     True    True
# 22  0.0  False     True   False
# 23  0.0  False     True   False
# 24  0.0  False    False   False
# 25  0.0  False    False   False
# 26  0.0  False    False   False
# 27  0.0  False    False   False
Run Code Online (Sandbox Code Playgroud)

既然Trues代表"降雨事件",我们可以使用ndimage.label以下命令为每个降雨事件分配一个唯一的编号:

df['labeled'], nobjs = ndimage.label(df['erosion'])
#      RF   mask dilation erosion  labeled
# 0   0.3   True     True    True        1
# 1   1.1   True     True    True        1
# 2   0.2   True     True    True        1
# 3   0.0  False     True    True        1
# 4   0.0  False     True    True        1
# 5   0.0  False     True    True        1
# 6   0.0  False     True    True        1
# 7   0.0  False     True    True        1
# 8   1.1   True     True    True        1
# 9   0.6   True     True    True        1
# 10  0.0  False     True   False        0
# 11  0.0  False     True   False        0
# 12  0.0  False    False   False        0
# 13  0.0  False     True   False        0
# 14  0.0  False     True   False        0
# 15  0.0  False     True   False        0
# 16  0.8   True     True    True        2
# 17  1.1   True     True    True        2
# 18  2.3   True     True    True        2
# 19  1.4   True     True    True        2
# 20  0.4   True     True    True        2
# 21  0.2   True     True    True        2
# 22  0.0  False     True   False        0
# 23  0.0  False     True   False        0
# 24  0.0  False    False   False        0
# 25  0.0  False    False   False        0
# 26  0.0  False    False   False        0
# 27  0.0  False    False   False        0
Run Code Online (Sandbox Code Playgroud)

并用于np.where将标签号减1 df['labeled'] > 0,并np.nan另行指定:

df['evtId'] = np.where(df['labeled'] > 0, df['labeled']-1, np.nan)
#      RF   mask dilation erosion  labeled  evtId
# 0   0.3   True     True    True        1      0
# 1   1.1   True     True    True        1      0
# 2   0.2   True     True    True        1      0
# 3   0.0  False     True    True        1      0
# 4   0.0  False     True    True        1      0
# 5   0.0  False     True    True        1      0
# 6   0.0  False     True    True        1      0
# 7   0.0  False     True    True        1      0
# 8   1.1   True     True    True        1      0
# 9   0.6   True     True    True        1      0
# 10  0.0  False     True   False        0    NaN
# 11  0.0  False     True   False        0    NaN
# 12  0.0  False    False   False        0    NaN
# 13  0.0  False     True   False        0    NaN
# 14  0.0  False     True   False        0    NaN
# 15  0.0  False     True   False        0    NaN
# 16  0.8   True     True    True        2      1
# 17  1.1   True     True    True        2      1
# 18  2.3   True     True    True        2      1
# 19  1.4   True     True    True        2      1
# 20  0.4   True     True    True        2      1
# 21  0.2   True     True    True        2      1
# 22  0.0  False     True   False        0    NaN
# 23  0.0  False     True   False        0    NaN
# 24  0.0  False    False   False        0    NaN
# 25  0.0  False    False   False        0    NaN
# 26  0.0  False    False   False        0    NaN
# 27  0.0  False    False   False        0    NaN
Run Code Online (Sandbox Code Playgroud)

请注意,扩张后进行侵蚀称为 关闭.我使用ndimage.binary_dilationndimage.binary_erosion不是仅仅调用ndimage.binary_closing的原因是因为我需要设置 border_value=1以防止边缘边缘被侵蚀.比较df['erosion']

ndimage.binary_closing(mask, structure=[1]*(consecutive+1))
Run Code Online (Sandbox Code Playgroud)

你会看到差异.