Python Pandas:为两个分类变量的唯一组合创建变量?

jbu*_*_13 2 python combinations pandas

假设我有一些数据:

df = pd.DataFrame({'location':['store','online','store','online','online'],
                  'item': ['apple','apple','orange','orange','orange']})
df
>>>

location    item
0   store   apple
1   online  apple
2   store   orange
3   online  orange
4   online  orange
Run Code Online (Sandbox Code Playgroud)

您将注意到,有四种可能的变量组合:(store,apple)、(online,apple)、(store,orange)、(online,orange)。我想分配一个虚拟变量列。我天真的方法创建了四个虚拟变量,而我想要一个标签列:

pd.get_dummies(df,['location','item'])
>>>

location_online location_store  item_apple  item_orange
0   0   1   1   0
1   1   0   1   0
2   0   1   0   1
3   1   0   0   1
4   1   0   0   1
Run Code Online (Sandbox Code Playgroud)

而我更喜欢它看起来像:

df 
>>>
location    item   combination     dummy
0   store   apple  (store, apple)   0
1   online  apple  (online, apple)  1
2   store   orange (store, orange)  2
3   online  orange (online, orange) 3
4   online  orange (online, orange) 3
Run Code Online (Sandbox Code Playgroud)

请注意,虚拟值仅等于索引,因为只有 4 行。这显然不是普遍正确的。

Edit1:以上是为了回应评论而编辑的。Edit2:我添加了第五行来说明一行可以重复,但是,它应该具有与其副本相同的虚拟/组合。

Qua*_*ang 5

咱们试试吧agg

df['combination'] = df[['location','item']].agg(tuple, axis=1)
df['dummy'] = df['combination'].factorize()[0]
Run Code Online (Sandbox Code Playgroud)

输出:

  location    item       combination  dummy
0    store   apple    (store, apple)      0
1   online   apple   (online, apple)      1
2    store  orange   (store, orange)      2
3   online  orange  (online, orange)      3
4   online  orange  (online, orange)      3
Run Code Online (Sandbox Code Playgroud)

如果你不关心combination,你可以使用groupby.ngroup()

df['dummy'] = df.groupby(['location','item'], sort=False).ngroup()
Run Code Online (Sandbox Code Playgroud)

输出:

  location    item  dummy
0    store   apple      0
1   online   apple      1
2    store  orange      2
3   online  orange      3
4   online  orange      3
Run Code Online (Sandbox Code Playgroud)