标签: grid-search

通过 GridSearchCV 获取精确模型以进行召回优化

给定一个名为“m”的机器学习模型 RBF SVC,我对 gamma 值执行了 gridSearchCV,以优化召回率。我想回答这个问题:“网格搜索应该找到最优化召回率的模型。这个模型的召回率比精度好多少?”

所以我做了 gridSearchCV:

grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
grid_m_re = GridSearchCV(m, param_grid = grid_values, scoring = 'recall')
grid_m_re.fit(X_train, y_train)
y_decision_fn_scores_re = grid_m_re.decision_function(X_test) 

print('Grid best parameter (max. recall): ', grid_m_re.best_params_)
print('Grid best score (recall): ', grid_m_re.best_score_)
Run Code Online (Sandbox Code Playgroud)

这告诉我最好的模型是 gamma=0.001,它的召回分数为 1。

我想知道如何获得此模型的精度以获取此模型的交易,因为 GridSearchCV 仅具有获取其优化目的的属性。( [Doc sklearn.GridSearchCV][1])

python scikit-learn grid-search

1
推荐指数
1
解决办法
8590
查看次数

配合使用带有管道和GridSearch的cross_val_score嵌套的交叉验证

我正在使用scikit,正在尝试调整XGBoost。我尝试使用嵌套的交叉验证,通过管道对训练折叠进行重新缩放(以避免数据泄漏和过度拟合),并与GridSearchCV并行进行参数调整,并与cross_val_score并行获得roc_auc得分。

from imblearn.pipeline import Pipeline 
from sklearn.model_selection import RepeatedKFold 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier


std_scaling = StandardScaler() 
algo = XGBClassifier()

steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]

pipeline = Pipeline(steps)

parameters = {'algo__min_child_weight': [1, 2],
              'algo__subsample': [0.6, 0.9],
              'algo__max_depth': [4, 6],
              'algo__gamma': [0.1, 0.2],
              'algo__learning_rate': [0.05, 0.5, 0.3]}

cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)

clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)

cv1 = RepeatedKFold(n_splits=2, …
Run Code Online (Sandbox Code Playgroud)

pipeline nested scikit-learn cross-validation grid-search

1
推荐指数
1
解决办法
1105
查看次数

带有 lightgbm 的 GridSearchCV 需要不使用 fit() 方法?

我正在尝试使用GridSearchCVLightGBMsklearn估计器,但在构建搜索时遇到问题。

我要构建的代码如下所示:

d_train = lgb.Dataset(X_train, label=y_train)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10

clf = lgb.train(params, d_train, 100)

param_grid = {
    'num_leaves': [10, 31, 127],
    'boosting_type': ['gbdt', 'rf'],
    'learning rate': [0.1, 0.001, 0.003]
    }


gsearch = GridSearchCV(estimator=clf, param_grid=param_grid)
lgb_model = gsearch.fit(X=train, y=y)
Run Code Online (Sandbox Code Playgroud)

但是我遇到了以下错误:

TypeError: estimator should be an estimator implementing 'fit' method, 
          <lightgbm.basic.Booster object at 0x0000014C55CA2880> …
Run Code Online (Sandbox Code Playgroud)

python scikit-learn grid-search lightgbm

1
推荐指数
1
解决办法
1585
查看次数

Scikit:如何检查对象是 RandomizedSearchCV 还是 RandomForestClassifier?

我有一些分类器是使用Grid Search创建的,其他分类器是直接创建为Random Forests 的

随机森林返回 type sklearn.ensemble.forest.RandomForestClassifier,使用 gridSearch 创建的随机森林返回 type sklearn.grid_search.RandomizedSearchCV

我正在尝试以编程方式检查估计器的类型(以确定是否需要best_estimator_特征重要性使用),但似乎找不到这样做的好方法。

if type(estimator) == 'sklearn.grid_search.RandomizedSearchCV' 是我的第一个猜测,但显然是错误的。

types python-2.7 random-forest scikit-learn grid-search

0
推荐指数
1
解决办法
1633
查看次数

ValueError:不支持连续

我使用GridSearchCV进行线性回归的交叉验证(不是分类器也不是逻辑回归).

我还使用StandardScaler来标准化X.

我的数据框有17个特征(X)和5个目标(y)(观察).大约1150行

我一直得到ValueError:不支持连续错误消息并且没有选项.

这里有一些代码(假设所有导入都正确完成):

soilM = pd.read_csv('C:/training.csv', index_col=0)
soilM = getDummiedSoilDepth(soilM) #transform text values in 0 and 1

soilM = soilM.drop('Depth', 1) 

soil = soilM.iloc[:,-22:]

X_train, X_test, Ca_train, Ca_test, P_train, P_test, pH_train, pH_test, SOC_train, SOC_test, Sand_train, Sand_test = splitTrainTestAdv(soil)

scores = ['precision', 'recall']


for score in scores:

    for model in MODELS.keys():

        print model, score

        performParameterSelection(model, score, X_test, Ca_test, X_train, Ca_train)

def performParameterSelection(model_name, criteria, X_test, y_test, X_train, y_train):

    model, param_grid = MODELS[model_name]
    gs = GridSearchCV(model, param_grid, n_jobs= 1, cv=5, verbose=1, …
Run Code Online (Sandbox Code Playgroud)

python linear-regression pandas scikit-learn grid-search

0
推荐指数
1
解决办法
1万
查看次数

“并行”管道使用gridsearch获得最佳模型

在sklearn中,可以定义串行管道,以使管道的所有连续部分都获得超参数的最佳组合。串行管道可以实现如下:

from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target

#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': …
Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn grid-search

0
推荐指数
1
解决办法
1885
查看次数

SVR超参数选择和可视化

我只是数据分析的初学者。我想使用“交叉验证网格搜索方法”来确定径向基函数 (RBF) 内核 SVM 的参数 gamma 和 C。我不知道应该将数据放在这段代码的哪里,也不知道我的数据类型是什么应该使用(训练或目标数据)?

对于SVR

import numpy as np
import pandas as pd
from math import sqrt
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error,explained_variance_score
from TwoStageTrAdaBoostR2 import TwoStageTrAdaBoostR2 # import the two-stage algorithm
from sklearn import preprocessing
from sklearn import svm
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from matplotlib.colors import Normalize
from sklearn.svm import SVC

# Data import (source)
source= pd.read_csv(sourcedata) …
Run Code Online (Sandbox Code Playgroud)

data-visualization svm data-analysis scikit-learn grid-search

0
推荐指数
1
解决办法
4395
查看次数

自定义变压器和GridSearch-管道中的ValueError

我正在尝试使用一些自定义转换器来优化scikit-learn管道中的超参数,但我不断遇到错误:

from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class RollingMeanTransform(BaseEstimator, TransformerMixin):

    def __init__(self, col, window=3):
        self._window = window
        self._col = col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        df = X.copy()
        df['{}_rolling_mean'.format(self._col)] = df[self._col].shift(1).rolling(self._window).mean().fillna(0.0)
        return df


class TimeEncoding(BaseEstimator, TransformerMixin):

    def __init__(self, col, drop_original=True):
        self._col = col 
        self._drop_original = drop_original

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        unique_vals = float(len(X[self._col].unique()))
        X['sin_{}'.format(self._col)] = np.sin(2 * …
Run Code Online (Sandbox Code Playgroud)

python scikit-learn grid-search

-1
推荐指数
1
解决办法
472
查看次数