使用pandas中的多个值从列创建虚拟对象

mkl*_*kln 30 python dummy-data pandas categorical-data

我正在寻找一种pythonic方式来处理以下问题.

The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'], get_dummies() creates 2 dummy variables and assigns 0 or 1 accordingly.

Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D'] . get_dummies() creates 6 dummies, but I only want 4 of them, so that a row could have multiple 1s.

Is there a way to handle this in a pythonic way? I could only think of some step-by-step algorithm to get it, but that would not include get_dummies(). Thanks

Edited, hope it is more clear!

off*_*one 63

我知道已经有一段时间了,因为这个问题被问到了,但是(至少现在有)一个文件支持的单行:

In [4]: df
Out[4]:
      label
0  (a, c, e)
1     (a, d)
2       (b,)
3     (d, e)

In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*')
Out[5]:
   a  b  c  d  e
0  1  0  1  0  1
1  1  0  0  1  0
2  0  1  0  0  0
3  0  0  0  1  1
Run Code Online (Sandbox Code Playgroud)

  • 根据我的问题,我认为`df ['label'].str.get_dummies(sep ='*')`就足够了.我想如果字符串尚未形成,则需要第一部分,因为get_dummies函数需要它们 (3认同)

ari*_*ell 5

我有一个更清洁的解决方案.假设我们想要转换以下数据帧

   pageid category
0       0        a
1       0        b
2       1        a
3       1        c
Run Code Online (Sandbox Code Playgroud)

        a  b  c
pageid         
0       1  1  0
1       1  0  1
Run Code Online (Sandbox Code Playgroud)

一种方法是使用scikit-learn的DictVectorizer.但是,我会对学习其他方法感兴趣.

df = pd.DataFrame(dict(pageid=[0, 0, 1, 1], category=['a', 'b', 'a', 'c']))

grouped = df.groupby('pageid').category.apply(lambda lst: tuple((k, 1) for k in lst))
category_dicts = [dict(tuples) for tuples in grouped]
v = sklearn.feature_extraction.DictVectorizer(sparse=False)
X = v.fit_transform(category_dicts)

pd.DataFrame(X, columns=v.get_feature_names(), index=grouped.index)
Run Code Online (Sandbox Code Playgroud)