Python Pandas：为两个分类变量的唯一组合创建变量？

Question

Python Pandas：为两个分类变量的唯一组合创建变量？

假设我有一些数据：

df = pd.DataFrame({'location':['store','online','store','online','online'],
                  'item': ['apple','apple','orange','orange','orange']})
df
>>>

location    item
0   store   apple
1   online  apple
2   store   orange
3   online  orange
4   online  orange

Run Code Online (Sandbox Code Playgroud)

您将注意到，有四种可能的变量组合：(store,apple)、(online,apple)、(store,orange)、(online,orange)。我想分配一个虚拟变量列。我天真的方法创建了四个虚拟变量，而我想要一个标签列：

pd.get_dummies(df,['location','item'])
>>>

location_online location_store  item_apple  item_orange
0   0   1   1   0
1   1   0   1   0
2   0   1   0   1
3   1   0   0   1
4   1   0   0   1

Run Code Online (Sandbox Code Playgroud)

而我更喜欢它看起来像：

df 
>>>
location    item   combination     dummy
0   store   apple  (store, apple)   0
1   online  apple  (online, apple)  1
2   store   orange (store, orange)  2
3   online  orange (online, orange) 3
4   online  orange (online, orange) 3

Run Code Online (Sandbox Code Playgroud)

请注意，虚拟值仅等于索引，因为只有 4 行。这显然不是普遍正确的。

Edit1：以上是为了回应评论而编辑的。Edit2：我添加了第五行来说明一行可以重复，但是，它应该具有与其副本相同的虚拟/组合。

Answer 1

Qua*_*ang 5

咱们试试吧agg：

df['combination'] = df[['location','item']].agg(tuple, axis=1)
df['dummy'] = df['combination'].factorize()[0]

Run Code Online (Sandbox Code Playgroud)

输出：

  location    item       combination  dummy
0    store   apple    (store, apple)      0
1   online   apple   (online, apple)      1
2    store  orange   (store, orange)      2
3   online  orange  (online, orange)      3
4   online  orange  (online, orange)      3

Run Code Online (Sandbox Code Playgroud)

如果你不关心combination，你可以使用groupby.ngroup()：

df['dummy'] = df.groupby(['location','item'], sort=False).ngroup()

Run Code Online (Sandbox Code Playgroud)

输出：

  location    item  dummy
0    store   apple      0
1   online   apple      1
2    store  orange      2
3   online  orange      3
4   online  orange      3

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，2 月前
查看次数：	2142 次
最近记录：	5 年，2 月前