pbu*_*pbu 3 python machine-learning pandas scikit-learn
我正在研究机器学习的汽车评估数据集,数据集是这样的
buying,maint,doors,persons,lug_boot,safety,class
vhigh,vhigh,2,2,small,low,unacc
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
Run Code Online (Sandbox Code Playgroud)
我想将这些字符串按列转换为唯一的枚举整数.我看到pandas.factorize()是要走的路,但它只适用于一列.如何使用一个命令一次性分解数据帧.
我尝试了lambda功能,但它无法正常工作.
df.apply(lambda c:pd.factorize(c),axis = 1)
输出:
0 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, low,...
1 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, med,...
2 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, small, high...
3 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, low, u...
4 ([0, 0, 1, 1, 2, 2, 3], [vhigh, 2, med, unacc])
5 ([0, 0, 1, 1, 2, 3, 4], [vhigh, 2, med, high, ...
Run Code Online (Sandbox Code Playgroud)
我看到编码的值,但不能从上面的数组拉出来
Factorize返回(值,标签)的元组.您只需要DataFrame中的值.
In [26]: cols = ['buying', 'maint', 'lug_boot', 'safety', 'class']
In [27]: df[cols].apply(lambda x: pd.factorize(x)[0])
Out[27]:
buying maint lug_boot safety class
0 0 0 0 0 0
1 0 0 0 1 0
2 0 0 0 2 0
3 0 0 1 0 0
4 0 0 1 1 0
5 0 0 1 2 0
Run Code Online (Sandbox Code Playgroud)
然后将其连接到数字数据.
但需要注意的是:这意味着"低"安全性和"高"安全性与"医疗"安全性的距离相同.你可能最好使用pd.get_dummies
:
In [37]: dummies = []
In [38]: for col in cols:
....: dummies.append(pd.get_dummies(df[col]))
....:
In [39]: pd.concat(dummies, axis=1)
Out[39]:
vhigh vhigh med small high low med unacc
0 1 1 0 1 0 1 0 1
1 1 1 0 1 0 0 1 1
2 1 1 0 1 1 0 0 1
3 1 1 1 0 0 1 0 1
4 1 1 1 0 0 0 1 1
5 1 1 1 0 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
get_dummies
有一些可选参数来控制你可能想要的命名.