随机森林的varImp(插入符号)和重要性(randomForest)之间的差异

Raf*_* OR 11 r feature-selection random-forest r-caret

我不明白随机森林模型的varImp函数(caret包)和importance函数(randomForest包)之间的区别是什么:

我计算了一个简单的RF分类模型,当计算变量重要性时,我发现两个函数的预测变量的"排名"并不相同:

这是我的代码:

rfImp <- randomForest(Origin ~ ., data = TAll_CS,
                       ntree = 2000,
                       importance = TRUE)

importance(rfImp)

                                 BREAST       LUNG MeanDecreaseAccuracy MeanDecreaseGini
Energy_GLCM_R1SC4NG3        -1.44116806  2.8918537            1.0929302        0.3712622
Contrast_GLCM_R1SC4NG3      -2.61146974  1.5848150           -0.4455327        0.2446930
Entropy_GLCM_R1SC4NG3       -3.42017102  3.8839464            0.9779201        0.4170445
...

varImp(rfImp)
                                 BREAST        LUNG
Energy_GLCM_R1SC4NG3         0.72534283  0.72534283
Contrast_GLCM_R1SC4NG3      -0.51332737 -0.51332737
Entropy_GLCM_R1SC4NG3        0.23188771  0.23188771
...
Run Code Online (Sandbox Code Playgroud)

我以为他们使用相同的"算法"但我现在不确定.

编辑

为了重现该问题,ionosphere可以使用数据集(kknn包):

library(kknn)
data(ionosphere)
rfImp <- randomForest(class ~ ., data = ionosphere[,3:35],
                       ntree = 2000,
                       importance = TRUE)
importance(rfImp)
             b        g MeanDecreaseAccuracy MeanDecreaseGini
V3  21.3106205 42.23040             42.16524        15.770711
V4  10.9819574 28.55418             29.28955         6.431929
V5  30.8473944 44.99180             46.64411        22.868543
V6  11.1880372 33.01009             33.18346         6.999027
V7  13.3511887 32.22212             32.66688        14.100210
V8  11.8883317 32.41844             33.03005         7.243705
V9  -0.5020035 19.69505             19.54399         2.501567
V10 -2.9051578 22.24136             20.91442         2.953552
V11 -3.9585608 14.68528             14.11102         1.217768
V12  0.8254453 21.17199             20.75337         3.298964
...

varImp(rfImp)
            b         g
V3  31.770511 31.770511
V4  19.768070 19.768070
V5  37.919596 37.919596
V6  22.099063 22.099063
V7  22.786656 22.786656
V8  22.153388 22.153388
V9   9.596522  9.596522
V10  9.668101  9.668101
V11  5.363359  5.363359
V12 10.998718 10.998718
...
Run Code Online (Sandbox Code Playgroud)

我想我错过了一些东西......

编辑2

我想出如果你做前两列的每一行的平均值importance(rfImp),你会得到以下结果varImp(rfImp):

impRF <- importance(rfImp)[,1:2]
apply(impRF, 1, function(x) mean(x))
       V3        V4        V5        V6        V7        V8        V9 
31.770511 19.768070 37.919596 22.099063 22.786656 22.153388  9.596522 
      V10       V11       V12 
 9.668101  5.363359 10.998718     ...

# Same result as in both columns of varImp(rfImp)
Run Code Online (Sandbox Code Playgroud)

我不知道为什么会这样,但必须有一个解释.

Sha*_*ape 16

如果我们遍历varImp的方法:

检查对象:

> getFromNamespace('varImp','caret')
function (object, ...) 
{
    UseMethod("varImp")
}
Run Code Online (Sandbox Code Playgroud)

获取S3方法:

> getS3method('varImp','randomForest')
function (object, ...) 
{
    code <- varImpDependencies("rf")
    code$varImp(object, ...)
}
<environment: namespace:caret>


code <- caret:::varImpDependencies('rf')

> code$varImp
function(object, ...){
                    varImp <- randomForest::importance(object, ...)
                    if(object$type == "regression")
                      varImp <- data.frame(Overall = varImp[,"%IncMSE"])
                    else {
                      retainNames <- levels(object$y)
                      if(all(retainNames %in% colnames(varImp))) {
                        varImp <- varImp[, retainNames]
                      } else {
                        varImp <- data.frame(Overall = varImp[,1])
                      }
                    }

                    out <- as.data.frame(varImp)
                    if(dim(out)[2] == 2) {
                      tmp <- apply(out, 1, mean)
                      out[,1] <- out[,2] <- tmp  
                    }
                    out
                  }
Run Code Online (Sandbox Code Playgroud)

所以这并不是严格地返回randomForest :: importance,

它首先计算,然后只选择数据集中的分类值.

然后它做了一些有趣的事情,它检查我们是否只有两列:

if(dim(out)[2] == 2) {
   tmp <- apply(out, 1, mean)
   out[,1] <- out[,2] <- tmp  
}
Run Code Online (Sandbox Code Playgroud)

根据varImp手册页:

随机森林:varImp.randomForest和varImp.RandomForest分别是来自randomForest和party软件包的重要性函数的包装器.

事实显然并非如此.


至于为什么......

如果我们只有两个值,则变量作为预测变量的重要性可以表示为一个值.

如果变量是预测变量g,那么它也必须是预测变量b

它确实有意义,但这不符合他们关于函数功能的文档,因此我可能会将此报告为意外行为.当您希望自己进行相对计算时,该功能会尝试提供帮助.


Joh*_*nes 5

这个答案是@Shape 对解决方案的补充。我认为importance遵循 Breiman 的众所周知的方法来计算报告的变量重要性MeanDecreaseAccuracy,即对于每棵树的袋外样本计算树的准确性,然后一个接一个地排列变量并测量准确性排列后计算没有该变量的准确度下降。
我无法找到关于如何计算第一列中特定于类的准确度降低的准确信息,但我认为它是正确预测的类 k / 总预测类 k

正如@Shape 解释的那样,varImp不报告MeanDecreaseAccuracy报告的importance,而是计算(缩放的)特定于类别的准确性降低的平均值,并为每个类别报告。(对于 2 个以上的类,varImp仅报告特定于类的准确率下降。)
仅当类分布相等时,此方法才相似。原因是只有在平衡的情况下,一个类别的准确度降低会同样降低另一类别的准确度。

library(caret)
library(randomForest)
library(mlbench)

### Balanced sample size ###
data(Ionosphere)
rfImp1 <- randomForest(Class ~ ., data = Ionosphere[,3:35], ntree = 1000, importance = TRUE)

# How importance() calculates the overall decerase in accuracy for the variable
Imp1 <- importance(rfImp1, scale = FALSE)
summary(Ionosphere$Class)/nrow(Ionosphere)
classRatio1 <- summary(Ionosphere$Class)/nrow(Ionosphere)
#      bad      good 
#0.3589744 0.6410256 

# Caret calculates a simple mean
varImp(rfImp1, scale = FALSE)["V3",] # 0.04542253
Imp1["V3", "bad"] * 0.5 + Imp1["V3", "good"] * 0.5 # 0.04542253
# importance is closer to the weighted average of class importances
Imp1["V3", ] # 0.05262225  
Imp1["V3", "bad"] * classRatio1[1] + Imp1["V3", "good"] * classRatio1[2] # 0.05274091

### Equal sample size ###
Ionosphere2 <- Ionosphere[c(which(Ionosphere$Class == "good"), sample(which(Ionosphere$Class == "bad"), 225, replace = TRUE)),]
summary(Ionosphere2$Class)/nrow(Ionosphere2)
classRatio2 <- summary(Ionosphere2$Class)/nrow(Ionosphere2)
#  bad good 
# 0.5  0.5

rfImp2 <- randomForest(Class ~ ., data = Ionosphere2[,3:35], ntree = 1000, importance = TRUE)
Imp2 <- importance(rfImp2, scale = FALSE)

# Caret calculates a simple mean
varImp(rfImp2, scale = FALSE)["V3",] # 0.06126641 
Imp2["V3", "bad"] * 0.5 + Imp2["V3", "good"] * 0.5 # 0.06126641 
# As does the average adjusted for the balanced class ratio
Imp2["V3", "bad"] * classRatio2[1] + Imp2["V3", "good"] * classRatio2[2] # 0.06126641 
# There is now not much difference between the measure for balanced classes
Imp2["V3",] # 0.06106229
Run Code Online (Sandbox Code Playgroud)

我相信这可以解释为插入符号对所有类赋予同等的权重,而importance如果变量对更常见的类很重要,则报告变量更重要。我倾向于同意 Max Kuhn 的观点,但应该在文档的某处解释差异。