sklearn中的交叉验证+决策树

raz*_*113 9 machine-learning decision-tree cross-validation

尝试使用sklearn和panads创建具有交叉验证的决策树.

我的问题是在下面的代码中,交叉验证分割数据,然后我将其用于训练和测试.我将尝试通过在不同的最大深度设置下重新创建n次来找到树的最佳深度.在使用交叉验证时,我应该使用k folds CV,如果是这样,我将如何在我的代码中使用它?

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import cross_validation

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)

df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']

x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)

depth = []
for i in range(3,20):
    clf = tree.DecisionTreeClassifier(max_depth=i)
    clf = clf.fit(x_train,y_train)
    depth.append((i,clf.score(x_test,y_test)))
print depth
Run Code Online (Sandbox Code Playgroud)

这里是我正在使用的数据的链接,以防任何人. https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope

Dim*_*nis 21

在您的代码中,您将创建静态训练测试分割.如果要通过交叉验证选择最佳深度,可以sklearn.cross_validation.cross_val_score在for循环内使用.

您可以阅读sklearn的文档以获取更多信息.

以下是使用CV更新代码:

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.cross_validation import cross_val_score
from pprint import pprint

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']

# x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)
depth = []
for i in range(3,20):
    clf = tree.DecisionTreeClassifier(max_depth=i)
    # Perform 7-fold cross validation 
    scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4)
    depth.append((i,scores.mean()))
print(depth)
Run Code Online (Sandbox Code Playgroud)

或者,您可以sklearn.grid_search.GridSearchCV自己使用而不是编写for循环,特别是如果要优化多个超参数.

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.model_selection import GridSearchCV

features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

df = pd.read_csv('magic04.data',header=None,names=features)
df['class'] = df['class'].map({'g':0,'h':1})

x = df[features[:-1]]
y = df['class']


parameters = {'max_depth':range(3,20)}
clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4)
clf.fit(X=x, y=y)
tree_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_) 
Run Code Online (Sandbox Code Playgroud)

编辑:更改了GridSearchCV的导入方式,以适应learn2day的评论.

  • `grid_search`自0.18以来被弃用,并从0.20开始被删除.您现在应该使用`sklearn.model_selection`中的`GridSearchCV` (5认同)
  • @Rookie_123 如果您选择使用交叉验证来优化模型的超参数,那么最好先进行训练/测试拆分,在训练集上进行训练和交叉验证,最后在您创建的第一个测试集上进行测试。`sklearn.model_selection.train_test_split` 对于火车测试拆分很方便 (3认同)
  • +1用于回答问题并建议网格搜索,这对于此类问题肯定是更好的做法 (2认同)