当各个分类器适合不同的数据集时，如何在 sklearn 中构建投票分类器？

Question

当各个分类器适合不同的数据集时，如何在 sklearn 中构建投票分类器？

rv1*_*123 6 machine-learning scikit-learn ensemble-learning

我正在使用高度不平衡的数据构建分类器。我感兴趣的测试策略是使用3 个不同的重采样数据集集成一个模型。换句话说，每个数据集将包含稀有类别的所有样本，但只有丰富类别的 n 个样本（本文中提到的技术#4）。

我想在每个重采样数据集上拟合 3 个不同的模型，然后使用另一个VotingClassifiers（或类似的）模型组合各个模型的结果。我知道构建一个投票分类器看起来像这样： VotingClassifier

# First Model rnd_clf_1 = RandomForestClassifier() xgb_clf_1 = XGBClassifier() voting_clf_1 = VotingClassifier( estimators = [ ('rf', rnd_clf_1), ('xgb', xgb_clf_1), ], voting='soft' ) # And I can fit it with the first dataset this way: voting_clf_1.fit(X_train_1, y_train_1)
Run Code Online (Sandbox Code Playgroud)
但是，如果将它们三个拟合到不同的数据集上，如何堆叠它们呢？例如，如果我有三个拟合模型（请参见下面的代码），我可以构建一个函数，.predict_proba()在每个模型上调用该方法，然后“手动”平均各个概率。

但是……还有更好的办法吗？

# Fitting the individual models... but how to combine the predictions? voting_clf_1.fit(X_train_1, y_train_1) voting_clf_2.fit(X_train_2, y_train_2) voting_clf_3.fit(X_train_3, y_train_3)
Run Code Online (Sandbox Code Playgroud)
谢谢！

Answer 1

Ven*_*lam 1

通常，本文中显示的#4 方法是使用相同类型的分类器实现的。看起来您想尝试VotingClassifier每个示例数据集。

imblearn.ensemble.BalancedBaggingClassifier中已经实现了此方法，它是 Sklearn Bagging 方法的扩展。

您可以将估计器VotingClassifier的数量作为您想要执行数据集采样的次数提供给估计器。使用sampling_strategyparam 来提及您想要在 Majority 类上进行下采样的比例。

工作示例：

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)

rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()

voting_clf_1 = VotingClassifier(
    estimators = [
        ('rf', rnd_clf_1), 
        ('xgb', xgb_clf_1),
    ],
    voting='soft'
)

bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS

y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))

Run Code Online (Sandbox Code Playgroud)

看这里。也许您可以在手动安装估算器后重用_predict_proba()和运行函数。_collect_probas()

归档时间：	6 年，9 月前
查看次数：	2407 次
最近记录：	5 年，5 月前