寻找具有条件的顺序模式

Question

寻找具有条件的顺序模式

No_*_*ody 8 python numpy data-manipulation dataframe pandas

我有一个df

  Id  Event SeqNo
   1    A    1
   1    B    2
   1    C    3
   1    ABD  4
   1    A    5
   1    C    6
   1    A    7
   1    CDE  8
   1    D    9
   1    B    10 
   1    ABD  11
   1    D    12
   1    B    13
   1    CDE  14
   1    A    15

Run Code Online (Sandbox Code Playgroud)

我正在寻找一种模式"ABD后跟CDE而它们之间没有事件B"例如,这个df的输出将是:

 Id  Event SeqNo
 1    ABD  4
 1    A    5
 1    C    6
 1    A    7
 1    CDE  8

Run Code Online (Sandbox Code Playgroud)

对于单个ID,可以多次遵循此模式,我想查找所有这些ID的列表及其各自的计数(如果可能).

Answer 1

Div*_*kar 2

这是一个矢量化的模型，具有一些缩放技巧并利用卷积来找到所需的模式 -

# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')

# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]

# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)

# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()

Run Code Online (Sandbox Code Playgroud)

这convolution部分可能有点棘手。这里的想法是使用id_ar具有1,2和3对应字符串'ABD',''B'和的值'CDE'。我们正在寻找1其次3，因此使用带有内核的卷积[9,1]将得到具有和 then1*1 + 3*9 = 28的窗口的卷积和。因此，我们寻找转化。比赛的总和。对于后跟 '和 then的情况，转换。sum 会不同，因此会被过滤掉。'ABD''CDE'28'ABD''B''CDE'

样本运行 -

1）输入数据框：

In [377]: df
Out[377]: 
   Id Event SeqNo
0   1     A     1
1   1     B     2
2   1     C     3
3   1   ABD     4
4   1     B     5
5   1     C     6
6   1     A     7
7   1   CDE     8
8   1     D     9
9   1     B    10
10  1   ABD    11
11  1     D    12
12  1     B    13
13  2     A     1
14  2     B     2
15  2     C     3
16  2   ABD     4
17  2     A     5
18  2     C     6
19  2     A     7
20  2   CDE     8
21  2     D     9
22  2     B    10
23  2   ABD    11
24  2     D    12
25  2     B    13
26  2   CDE    14
27  2     A    15

Run Code Online (Sandbox Code Playgroud)

2) 中间过滤的 o/p（查看列中Pattern是否存在 reqd. 模式）：

In [380]: df1
Out[380]: 
   Id Event SeqNo  Pattern
1   1     B     2        0
3   1   ABD     4        0
4   1     B     5        0
7   1   CDE     8        0
9   1     B    10        0
10  1   ABD    11        0
12  1     B    13        0
14  2     B     2        0
16  2   ABD     4        0
20  2   CDE     8        1
22  2     B    10        0
23  2   ABD    11        0
25  2     B    13        0
26  2   CDE    14        0

Run Code Online (Sandbox Code Playgroud)

3) 最终结果：

In [381]: out
Out[381]: 
Id
1    0
2    1
Name: Pattern, dtype: int64

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，8 月前
查看次数：	208 次
最近记录：	6 年，8 月前