sklearn随机森林：.oob_score_太低？

RandomForest “内部”逻辑与RANDOM-PROCESS密切配合，通过该过程，X具有已知 y == { labels（对于分类器）| targets（对于回归器）的样本（数据集）}在整个森林生成过程中进行分割，其中树通过随机分割数据集来引导，即树可以看到一部分，树将看不到（因此形成一个内部oob-subSET）。

除了对过度拟合等敏感性的其他影响之外，随机森林集成不需要进行交叉验证，因为它在设计上不会过度拟合。许多论文以及Breiman（伯克利）的经验证明都为这种说法提供了支持，因为他们提供了证据，即 CV-ed 预测器将具有相同的.oob_score_

import sklearn.ensemble
aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor( n_estimators                = 10,           # The number of trees in the forest.
                                                        criterion                   = 'mse',        # { Regressor: 'mse' | Classifier: 'gini' }
                                                        max_depth                   = None,
                                                        min_samples_split           = 2,
                                                        min_samples_leaf            = 1,
                                                        min_weight_fraction_leaf    = 0.0,
                                                        max_features                = 'auto',
                                                        max_leaf_nodes              = None,
                                                        bootstrap                   = True,
                                                        oob_score                   = False,        # SET True to get inner-CrossValidation-alike .oob_score_ attribute calculated right during Training-phase on the whole DataSET
                                                        n_jobs                      = 1,            # { 1 | n-cores | -1 == all-cores }
                                                        random_state                = None,
                                                        verbose                     = 0,
                                                        warm_start                  = False
                                                        )
aRF_PREDICTOR.estimators_                             # aList of <DecisionTreeRegressor>  The collection of fitted sub-estimators.
aRF_PREDICTOR.feature_importances_                    # array of shape = [n_features]     The feature importances (the higher, the more important the feature).
aRF_PREDICTOR.oob_score_                              # float                             Score of the training dataset obtained using an out-of-bag estimate.
aRF_PREDICTOR.oob_prediction_                         # array of shape = [n_samples]      Prediction computed with out-of-bag estimate on the training set.
    
aRF_PREDICTOR.apply(         X )                      # Apply trees in the forest to X, return leaf indices.
aRF_PREDICTOR.fit(           X, y[, sample_weight] )  # Build a forest of trees from the training set (X, y).
aRF_PREDICTOR.fit_transform( X[, y] )                 # Fit to data, then transform it.
aRF_PREDICTOR.get_params(          [deep] )           # Get parameters for this estimator.
aRF_PREDICTOR.predict(       X )                      # Predict regression target for X.
aRF_PREDICTOR.score(         X, y[, sample_weight] )  # Returns the coefficient of determination R^2 of the prediction.
aRF_PREDICTOR.set_params(          **params )         # Set the parameters of this estimator.
aRF_PREDICTOR.transform(     X[, threshold] )         # Reduce X to its most important features.

Run Code Online (Sandbox Code Playgroud)

还应告知，默认值在任何情况下都不是最好的，越差越好。在进一步前进之前，应注意问题域，以便提出一组合理的ensemble参数化。

问：什么是好的 .oob_score_ ？

答：.oob_score_ 是随机的！。。。。。。.....是的，它必须（是随机的）

虽然这听起来像是一个挑衅性的尾声，但不要放弃你的希望。随机森林集成是一个很棒的工具。特征（DataSET）中的分类值可能会带来一些问题X，但是，一旦您不需要与偏差或过度拟合作斗争，处理集成的成本仍然足够。那太好了，不是吗？

由于需要能够在后续重新运行时重现相同的结果，因此建议在随机过程之前（重新）设置numpy.random并.set_params( random_state = ... )设置为已知状态（嵌入到随机森林集成的每个引导中）。RandomForest这样做，人们可能会观察到基于的预测器朝着更好的方向“去噪”进展，.oob_score_而不是由于更多集成成员（）、更少约束的树结构（，等人）引入的真正 改进的预测能力，而不是在如何分割数据集的随机过程中，只是随机地“祝你好运”......n_estimatorsmax_depthmax_leaf_nodes

更接近更好的解决方案通常需要将更多的树纳入集合中（随机森林决策基于多数投票，因此 10 估计量并不是在高度复杂的数据集上做出良好决策的重要基础）。2000以上的数字并不少见。人们可以迭代一系列大小调整（随机过程保持在状态完全控制下）以演示集成的“改进”。

如果初始值.oob_score_落在约 0.51 - 0.53 左右，则您的整体比随机猜测好 1% - 3%

只有当你使基于集成的预测器变得更好之后，你才可以转向特征工程等方面的一些额外技巧。

aRF_PREDICTOR.oob_score_    Out[79]: 0.638801  # n_estimators =   10
aRF_PREDICTOR.oob_score_    Out[89]: 0.789612  # n_estimators =  100

Run Code Online (Sandbox Code Playgroud)

*“如果 .oob_score_ ~0.51 - 0.53 你的整体比随机猜测好 1% - 3%* 这是不正确的，他说这是一个七类分类问题。随机猜测将是 ~0.14。0.02 的 oob_score 是远远不够的比随机更糟糕。 (3认同)

归档时间：	11 年，5 月前
查看次数：	5363 次
最近记录：	6 年，10 月前