Eyt*_*tan 6 python machine-learning decision-tree scikit-learn categorical-data
While working with the DecisionTreeClassifier I visualized it using graphviz, and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data.
All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5:
From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn.
Anyone knows a way that I am missing to use the tree categorically? (I know it is not better for the task but as I need categories currently I am using one hot vectors on the data).
EDIT: a sample of the original data looks like this:
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 c1 c2 c3
0 C S O 1 2 1 1 2 1 2 0 0 0
1 D S O 1 3 1 1 2 1 2 0 0 0
2 C S O 1 3 1 1 2 1 1 0 0 0
3 D S O 1 3 1 1 2 1 2 0 0 0
4 D A O 1 3 1 1 2 1 2 0 0 0
5 D A O 1 2 1 1 2 1 2 0 0 0
6 D A O 1 2 1 1 2 1 1 0 0 0
7 D A O 1 2 1 1 2 1 2 0 0 0
8 D K O 1 3 1 1 2 1 2 0 0 0
9 C R O 1 3 1 1 2 1 1 0 0 0
Run Code Online (Sandbox Code Playgroud)
where X[0] = f1 and I encoded strings to integers as sklearn does not accept strings.
好吧,我很惊讶,但事实证明,sklearn的决策树确实无法处理分类数据。从2015年6月开始,这个问题(#4899)出现了Github问题,但是它仍然是开放的(我建议您快速浏览一下该线程,因为有些评论非常有趣)。
正如您在此处所做的那样,将分类变量编码为整数的问题在于,它对它们强加了一个顺序,根据情况的不同,该顺序可能有意义也可能没有意义。例如,你可以编码['low', 'medium', 'high']的[0, 1, 2],因为'low' < 'medium' < 'high'(我们称这些分类变量序数),但你仍然隐含作出额外的(也可能是不希望的)假设之间的距离'low'和'medium'与之间的距离相同'medium',并'high'(在没有影响决策树,但很重要,例如在k-nn和聚类中)。但是这种方法在诸如['red','green','blue']或的情况下完全失败了['male','female'],因为我们不能要求它们之间有任何有意义的相对顺序。
因此,对于非常规分类变量,正确编码它们以供sklearn决策树使用的方法是使用该OneHotEncoder模块。用户指南的“ 编码分类功能”部分也可能会有所帮助。
| 归档时间: |
|
| 查看次数: |
2321 次 |
| 最近记录: |