使用分类变量对数据帧进行矢量化/对比

Question

使用分类变量对数据帧进行矢量化/对比

Ame*_*ina 4 python pandas scikit-learn statsmodels

假设我有如下数据框:

      A      B
0   bar    one
1   bar  three
2  flux    six
3   bar  three
4   foo   five
5  flux    one
6   foo    two

Run Code Online (Sandbox Code Playgroud)

我想对它应用 虚拟编码对比,以便我得到:

Run Code Online (Sandbox Code Playgroud)

(即,每列将每个唯一值映射到不同的整数).

我尝试过使用scikit-learn的DictVectorizer,但我得到:

> from sklearn.feature_extraction import DictVectorizer as DV
> vectorizer        = DV( sparse = False )
> dict_to_vectorize = df.T.to_dict().values()
> df_vec            = vectorizer.fit_transform(dict_to_vectorize )
> df_vec
array([[ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

Run Code Online (Sandbox Code Playgroud)

这是因为scikit-learn DictVectorizer被设计为输出一个K编码.我想要的是一个简单的编码(每个变量一列).

我怎么能用scikit-learn和/或pandas做到这一点？除此之外,是否有任何其他Python包有助于一般的对比方法？

Answer 1

unu*_*tbu 7

你可以使用pd.factorize:

In [124]: df.apply(lambda x: pd.factorize(x)[0])
Out[124]: 
   A  B
0  0  0
1  0  1
2  1  2
3  0  1
4  2  3
5  1  0
6  2  4

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，4 月前
查看次数：	689 次
最近记录：	11 年，4 月前