我recursive feature elimination with cross validation (rfecv)用作以下功能选择器randomforest classifier。
X = df[[my_features]] #all my features
y = df['gold_standard'] #labels
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)
print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])
Run Code Online (Sandbox Code Playgroud)
我还执行GridSearchCV以下操作,以调整以下超参数RandomForestClassifier。
X = df[[my_features]] #all my features
y = df['gold_standard'] #labels
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = {
'n_estimators': [200, 500],
'max_features': …Run Code Online (Sandbox Code Playgroud) python machine-learning scikit-learn grid-search data-science
我试图在随机森林回归器的帮助下解决波士顿数据集上的回归问题。我正在使用GridSearchCV来选择最佳超参数。
问题一
我应该适合GridSearchCV一些X_train, y_train,然后获得最佳参数。
或者
我应该适合它X, y以获得最佳参数。(X,y =整个数据集)
问题二
说如果我适合它X, y并获得最佳参数,然后在这些最佳参数上建立一个新模型。现在我应该如何训练这个新模型?
我应该在X_train, y_train或上训练新模型X, y.
问题三
如果我训练新模型,X,y那么我将如何验证结果?
到目前为止我的代码
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Run Code Online (Sandbox Code Playgroud)
训练测试数据拆分
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Run Code Online (Sandbox Code Playgroud)
网格搜索以获得最佳超参数
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = …Run Code Online (Sandbox Code Playgroud) python machine-learning random-forest scikit-learn grid-search