Jas*_*hez 9 python pipeline machine-learning scikit-learn
我的目标是使用一个模型来选择最重要的变量,使用另一个模型来使用这些变量进行预测.在下面的示例中,我使用两个RandomForestClassifiers,但第二个模型可以是任何其他分类器.
RF具有带阈值参数的变换方法.我想网格搜索不同的可能阈值参数.
这是一个简化的代码片段:
# Transform object and classifier
rf_filter = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42, oob_score=False)
clf = RandomForestClassifier(n_jobs=-1, random_state=42, oob_score=False)
pipe = Pipeline([("RFF", rf_filter), ("RF", clf)])
# Grid search parameters
rf_n_estimators = [10, 20]
rff_transform = ["median", "mean"] # Search the threshold parameters
estimator = GridSearchCV(pipe,
cv = 3,
param_grid = dict(RF__n_estimators = rf_n_estimators,
RFF__threshold = rff_transform))
estimator.fit(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)
错误是 ValueError: Invalid parameter threshold for estimator RandomForestClassifier
我认为这会有效,因为文档说:
如果为None且可用,则使用对象属性阈值.
我尝试在网格搜索(rf_filter.threshold = "median"
)之前设置阈值属性并且它有效; 但是,我无法弄清楚如何对其进行网格搜索.
有没有办法迭代通常预期在分类器的转换方法中提供的不同参数?
小智 9
按照您描述的相同方法,即使用分组为管道的两个不同的随机森林分类器进行特征选择和分类,我遇到了同样的问题.
RandomForestClassifier类的实例没有名为threshold的属性.您确实可以使用您描述的方式或使用方式手动添加一个
setattr(object, 'threshold', 'mean')
Run Code Online (Sandbox Code Playgroud)
但主要问题似乎是get_params方法检查BaseEstimator的任何成员的有效属性的方式:
class BaseEstimator(object):
"""Base class for all estimators in scikit-learn
Notes
-----
All estimators should specify all the parameters that can be set
at the class level in their __init__ as explicit keyword
arguments (no *args, **kwargs).
"""
@classmethod
def _get_param_names(cls):
"""Get parameter names for the estimator"""
try:
# fetch the constructor or the original constructor before
# deprecation wrapping if any
init = getattr(cls.__init__, 'deprecated_original', cls.__init__)
# introspect the constructor arguments to find the model parameters
# to represent
args, varargs, kw, default = inspect.getargspec(init)
if not varargs is None:
raise RuntimeError("scikit-learn estimators should always "
"specify their parameters in the signature"
" of their __init__ (no varargs)."
" %s doesn't follow this convention."
% (cls, ))
# Remove 'self'
# XXX: This is going to fail if the init is a staticmethod, but
# who would do this?
args.pop(0)
except TypeError:
# No explicit __init__
args = []
args.sort()
return args
Run Code Online (Sandbox Code Playgroud)
实际上,正如明确指出的那样,所有估计器都应该将__init__中可以在类级别设置的所有参数指定为显式关键字参数.
所以我试着在__init__函数中指定阈值作为参数,默认值为'mean'(无论如何它是当前实现中的默认值)
def __init__(self,
n_estimators=10,
criterion="gini",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
max_features="auto",
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
min_density=None,
compute_importances=None,
threshold="mean"): # ADD THIS!
Run Code Online (Sandbox Code Playgroud)
然后将此参数的值赋给该类的参数.
self.threshold = threshold # ADD THIS LINE SOMEWHERE IN THE FUNCTION __INIT__
Run Code Online (Sandbox Code Playgroud)
当然,这意味着修改RandomForestClassifier类(在/python2.7/site-packages/sklearn/ensemble/forest.py中),这可能不是最好的方法......但它对我有用!我现在能够在不同的阈值参数上进行网格搜索(并交叉验证),从而选择不同数量的特征.
class my_rf_filter(BaseEstimator, TransformerMixin):
def __init__(self,threshold):
self.threshold = threshold
def fit(self,X,y):
model = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42, oob_score=False)
model.fit(X,y)
self.model = model
return self
def transform(self,X):
return self.model.transform(X,self.threshold)
Run Code Online (Sandbox Code Playgroud)
通过将RandomForestClassifier包装在新类中,它将起作用。
rf_filter = my_rf_filter(threshold='mean')
clf = RandomForestClassifier(n_jobs=-1, random_state=42, oob_score=False)
pipe = Pipeline([("RFF", rf_filter), ("RF", clf)])
# Grid search parameters
rf_n_estimators = [10, 20]
rff_transform = ["median", "mean"] # Search the threshold parameters
estimator = GridSearchCV(pipe,
cv = 3,
param_grid = dict(RF__n_estimators = rf_n_estimators,
RFF__threshold = rff_transform))
Run Code Online (Sandbox Code Playgroud)
一个测试例子:
from sklearn import datasets
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
estimator.fit(X_digits, y_digits)
Out[143]:
GridSearchCV(cv=3,
estimator=Pipeline(steps=[('RFF', my_rf_filter(threshold='mean')), ('RF', RandomForestClassifier(bootstrap=True, compute_importances=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
min_samples_split=2, n_estimators=10, n_jobs=-1,
oob_score=False, random_state=42, verbose=0))]),
fit_params={}, iid=True, loss_func=None, n_jobs=1,
param_grid={'RF__n_estimators': [10, 20], 'RFF__threshold': ['median', 'mean']},
pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
verbose=0)
estimator.grid_scores_
Out[144]:
[mean: 0.89705, std: 0.00912, params: {'RF__n_estimators': 10, 'RFF__threshold': 'median'},
mean: 0.91597, std: 0.00871, params: {'RF__n_estimators': 20, 'RFF__threshold': 'median'},
mean: 0.89705, std: 0.00912, params: {'RF__n_estimators': 10, 'RFF__threshold': 'mean'},
mean: 0.91597, std: 0.00871, params: {'RF__n_estimators': 20, 'RFF__threshold': 'mean'}]
Run Code Online (Sandbox Code Playgroud)
如果您需要修改的参数RandomForestClassifier
在my_rf_filter
类,我认为你需要添加它们明确,即不使用**kwargs
中__init__()
和model.set_paras(**kwargs)
,因为我没有这样做。我认为添加n_estimators=200
到__init__()
然后model.n_estimators = self.n_estimators
可以工作。