使用 itertools、melt 和 groupby 正确地使用 Pandas 计算每个属性值的事件对

ksp*_*spr 5 python pandas

我有以下格式的表格

  Id   |   Sequence   |   Attribute A  |  Attribute B |
  ID1       [A,B,C,D]         A1              B1        
  ID2       [A,B,F,G]         A2              B3            
  ID3       [A,B,C,D]         A1              B1        
Run Code Online (Sandbox Code Playgroud)

我想为每个事件组合和属性值计算唯一 ID 的数量。

决赛桌应该看起来像

  Pair    |  Attribute Type | Attribute Value   | ID Count
  (A,B)        Attribute A          A1              2        #Event A happens before event B in 2 unique ID's where A1 is the value of Attribute A.
  (A,C)        Attribute A          A1              2
  (A,D)        Attribute A          A1              2
  (B,C)        Attribute A          A1              2
  (B,D)        Attribute A          A1              2
  (C,D)        Attribute A          A1              2
  (A,B)        Attribute A          A2              1
  (A,F)        Attribute A          A2              1 
  (A,G)        Attribute A          A2              1 
  (B,F)        Attribute A          A2              1
  (B,G)        Attribute A          A2              1
  (F,G)        Attribute A          A2              1
  (A,B)        Attribute B          B1              2
  (A,C)        Attribute B          B1              2
  (A,D)        Attribute B          B1              2
  (B,C)        Attribute B          B1              2
  (B,D)        Attribute B          B1              2
  (C,D)        Attribute B          B1              2
  (A,B)        Attribute B          B3              1
  (A,F)        Attribute B          B3              1 
  (A,G)        Attribute B          B3              1 
  (B,F)        Attribute B          B3              1
  (B,G)        Attribute B          B3              1
  (F,G)        Attribute B          B3              1
Run Code Online (Sandbox Code Playgroud)

这样做的正确方法是什么?实际上,我将拥有的不仅仅是 2 个属性。

这是我走了多远

 df['Sequence Combs'] = df['Sequence'].apply(lambda x: list(itertools.combinations(x,2)))
 

  Id   |   Sequence   |          Event Combs                   |   Attribute A  |  Attribute B |
  ID1       [A,B,C,D]   [(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)]           A1              B1        
  ID2       [A,B,F,G]   [(A,B),(A,F),(A,G),(B,F),(B,G),(F,G)]           A2              B3              
  ID3       [A,B,C,D]   [(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)]           A1              B1      
Run Code Online (Sandbox Code Playgroud)

并且在做爆炸之后

df = df.explode('Sequence Combs')
Run Code Online (Sandbox Code Playgroud)

我得到以下

  Id   |   Sequence   |  Event Combs |  Attribute A  |  Attribute B |
  ID1       [A,B,C,D]       (A,B)           A1              B1        
  ID1       [A,B,C,D]       (A,C)           A1              B1        
  ID1       [A,B,C,D]       (A,D)           A1              B1        
  ID1       [A,B,C,D]       (B,C)           A1              B1        
  ID1       [A,B,C,D]       (B,D)           A1              B1        
  ID1       [A,B,C,D]       (C,D)           A1              B1        
  ...          ...           ..             ..              ..           
Run Code Online (Sandbox Code Playgroud)

但我不确定如何从这里开始,有什么想法吗?

Dan*_*ejo 2

你可以这样做:

from itertools import combinations

# create function for creating a list the 2-combinations
combs = lambda x: list(combinations(x, r=2))

# create new DataFrame with now the Sequence column is the list of the 2-combinations
res = df.assign(seq=df['Sequence'].apply(combs)).drop('Sequence', axis=1).rename(columns={'seq' : 'Sequence'})

# explode, then melt
res = res.explode('Sequence').melt(id_vars=['Id', 'Sequence'], var_name='Attribute Type', value_name='Attribute Value')

# finally group by all the columns but Id, and count
res = res.groupby(['Sequence', 'Attribute Type', 'Attribute Value'])['Id'].count()

print(res)
Run Code Online (Sandbox Code Playgroud)

输出

Sequence  Attribute Type  Attribute Value
(A, B)    Attribute A     A1                 2
                          A2                 1
          Attribute B     B1                 2
                          B3                 1
(A, C)    Attribute A     A1                 2
          Attribute B     B1                 2
(A, D)    Attribute A     A1                 2
          Attribute B     B1                 2
(A, F)    Attribute A     A2                 1
          Attribute B     B3                 1
(A, G)    Attribute A     A2                 1
          Attribute B     B3                 1
(B, C)    Attribute A     A1                 2
          Attribute B     B1                 2
(B, D)    Attribute A     A1                 2
          Attribute B     B1                 2
(B, F)    Attribute A     A2                 1
          Attribute B     B3                 1
(B, G)    Attribute A     A2                 1
          Attribute B     B3                 1
(C, D)    Attribute A     A1                 2
          Attribute B     B1                 2
(F, G)    Attribute A     A2                 1
          Attribute B     B3                 1
Name: Id, dtype: int64
Run Code Online (Sandbox Code Playgroud)

如果您想真正匹配预期输出,请执行以下操作:

# finally group by all the columns but Id, and count
res = res.groupby(['Sequence', 'Attribute Type', 'Attribute Value'], as_index=False)['Id'].count().rename({'Id' : 'Id Count'}).sort_values('Attribute Type')

print(res)
Run Code Online (Sandbox Code Playgroud)

输出

   Sequence Attribute Type Attribute Value  Id
0    (A, B)    Attribute A              A1   2
1    (A, B)    Attribute A              A2   1
20   (C, D)    Attribute A              A1   2
4    (A, C)    Attribute A              A1   2
6    (A, D)    Attribute A              A1   2
18   (B, G)    Attribute A              A2   1
8    (A, F)    Attribute A              A2   1
10   (A, G)    Attribute A              A2   1
22   (F, G)    Attribute A              A2   1
12   (B, C)    Attribute A              A1   2
16   (B, F)    Attribute A              A2   1
14   (B, D)    Attribute A              A1   2
21   (C, D)    Attribute B              B1   2
19   (B, G)    Attribute B              B3   1
17   (B, F)    Attribute B              B3   1
11   (A, G)    Attribute B              B3   1
13   (B, C)    Attribute B              B1   2
9    (A, F)    Attribute B              B3   1
7    (A, D)    Attribute B              B1   2
5    (A, C)    Attribute B              B1   2
3    (A, B)    Attribute B              B3   1
2    (A, B)    Attribute B              B1   2
15   (B, D)    Attribute B              B1   2
23   (F, G)    Attribute B              B3   1
Run Code Online (Sandbox Code Playgroud)