oth*_*r15 7 classification random-forest apache-spark apache-spark-mllib
我正在使用RandomForest.featureImportances但我不理解输出结果.
我有12个功能,这是我得到的输出.
我知道这可能不是一个特定于apache-spark的问题,但我无法找到解释输出的任何地方.
// org.apache.spark.mllib.linalg.Vector = (12,[0,1,2,3,4,5,6,7,8,9,10,11],
[0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0.06437052114945474,0.1601713590349946,0.0324327322375338,0.057751258970832206])
Run Code Online (Sandbox Code Playgroud)
eli*_*sah 13
给定树集合模型,RandomForest.featureImportances计算每个特征的重要性.
根据Leo Breiman和Adele Cutler的"随机森林"文献对Gini重要性的解释,并遵循scikit-learn的实施,这概括了"基尼"对其他损失的重要性.
对于树木的收集,包括提升和装袋,Hastie等.建议使用整体中所有树木的单树重要性的平均值.
此功能的重要性计算如下:
参考文献: Hastie,Tibshirani,Friedman."统计学习的要素,第2版." 2001. - 15.3.2变量重要性第593页.
让我们回到你的重要性向量:
val importanceVector = Vectors.sparse(12,Array(0,1,2,3,4,5,6,7,8,9,10,11), Array(0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0.06437052114945474,0.1601713590349946,0.0324327322375338,0.057751258970832206))
Run Code Online (Sandbox Code Playgroud)
首先,让我们按重要性对这些功能进行排序:
importanceVector.toArray.zipWithIndex
.map(_.swap)
.sortBy(-_._2)
.foreach(x => println(x._1 + " -> " + x._2))
// 0 -> 0.1956128039688559
// 9 -> 0.1601713590349946
// 2 -> 0.11302128590305296
// 3 -> 0.091986700351889
// 6 -> 0.06929766152519388
// 1 -> 0.06863606797951556
// 8 -> 0.06437052114945474
// 5 -> 0.05975817050022879
// 11 -> 0.057751258970832206
// 7 -> 0.052654922125615934
// 4 -> 0.03430651625283274
// 10 -> 0.0324327322375338
Run Code Online (Sandbox Code Playgroud)
那么这是什么意思 ?
这意味着您的第一个特征(索引0)是最重要的特征,权重为~0.19,而您的第11个(索引10)特征在模型中最不重要.
| 归档时间: |
|
| 查看次数: |
5673 次 |
| 最近记录: |