在Pandas中编码列标签以进行机器学习

pbu*_*pbu 3 python machine-learning pandas scikit-learn

我正在研究机器学习的汽车评估数据集,数据集是这样的

buying,maint,doors,persons,lug_boot,safety,class
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
Run Code Online (Sandbox Code Playgroud)

我想将这些字符串按列转换为唯一的枚举整数.我看到pandas.factorize()是要走的路,但它只适用于一列.如何使用一个命令一次性分解数据帧.

我尝试了lambda功能,但它无法正常工作.

df.apply(lambda c:pd.factorize(c),axis = 1)

输出:

   0     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, low,...

    1     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, med,...

    2     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, high...

    3     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, low, u...

    4       ([0, 0, 1, 1, 2, 2, 3], [vhigh, 2, med, unacc])

    5     ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, high, ...
Run Code Online (Sandbox Code Playgroud)

我看到编码的值,但不能从上面的数组拉出来

Tom*_*ger 6

Factorize返回(值,标签)的元组.您只需要DataFrame中的值.

In [26]: cols = ['buying', 'maint', 'lug_boot', 'safety', 'class']

In [27]: df[cols].apply(lambda x: pd.factorize(x)[0])
Out[27]: 
   buying  maint  lug_boot  safety  class
0       0      0         0       0      0
1       0      0         0       1      0
2       0      0         0       2      0
3       0      0         1       0      0
4       0      0         1       1      0
5       0      0         1       2      0
Run Code Online (Sandbox Code Playgroud)

然后将其连接到数字数据.

但需要注意的是:这意味着"低"安全性和"高"安全性与"医疗"安全性的距离相同.你可能最好使用pd.get_dummies:

In [37]: dummies = []

In [38]: for col in cols:
   ....:     dummies.append(pd.get_dummies(df[col]))
   ....:     

In [39]: pd.concat(dummies, axis=1)
Out[39]: 
   vhigh  vhigh  med  small  high  low  med  unacc
0      1      1    0      1     0    1    0      1
1      1      1    0      1     0    0    1      1
2      1      1    0      1     1    0    0      1
3      1      1    1      0     0    1    0      1
4      1      1    1      0     0    0    1      1
5      1      1    1      0     1    0    0      1
Run Code Online (Sandbox Code Playgroud)

get_dummies 有一些可选参数来控制你可能想要的命名.