如何在随机森林中使用Spark的功能重要性？

Question

如何在随机森林中使用Spark的功能重要性？

Cli*_*der 8 scala random-forest apache-spark apache-spark-mllib

该文件为随机森林不包括功能重要性有关.但是,它在Jira上列为已解决并且在源代码中.HERE还说"这个API和原始MLlib合奏API之间的主要区别是:

支持DataFrames和ML Pipelines
分类与回归的分离
使用DataFrame元数据来区分连续和分类功能
随机森林的更多功能:特征重要性的估计,以及每个类别(也称为类条件概率)的预测概率."

但是,我无法找出可以调用此新功能的语法.

scala> model
res13: org.apache.spark.mllib.tree.model.RandomForestModel = 
TreeEnsembleModel classifier with 10 trees

scala> model.featureImportances
<console>:60: error: value featureImportances is not a member of org.apache.spark.mllib.tree.model.RandomForestModel
              model.featureImportances

Run Code Online (Sandbox Code Playgroud)

Answer 1

Cli*_*der 3

您必须使用新的随机森林。检查您的进口。老人：

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel

Run Code Online (Sandbox Code Playgroud)

新的随机森林使用：

import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier

Run Code Online (Sandbox Code Playgroud)

这个 SO 答案提供了提取重要性的代码。

这个 SO 答案解释了返回的稀疏向量。

您能告诉我们如何处理特征重要性吗？它们是一个很大的 SparseVector 并且不可解释。如何将它们变成有用的东西？ (7认同)
@Yaeli778，https://spark.apache.org/docs/1.5.2/ml-ensembles.html 有一个关于如何训练模型的很好的例子 (2认同)

归档时间：	9 年，11 月前
查看次数：	6835 次
最近记录：	7 年，7 月前