小编aMK*_*MKa的帖子

在Apache Spark中为具有大量列的数据集创建ml管道的最佳方法

我正在使用Spark 2.1.1处理具有~2000个特征的数据集,并尝试创建一个基本的ML管道,包括一些变形金刚和分类器.

让我们假设为了简单起见,我正在使用的Pipeline包含一个VectorAssembler,StringIndexer和一个Classifier,这将是一个相当常见的用例.

// Pipeline elements
val assmbleFeatures: VectorAssembler = new VectorAssembler()
  .setInputCols(featureColumns)
  .setOutputCol("featuresRaw")

val labelIndexer: StringIndexer = new StringIndexer()
  .setInputCol("TARGET")
  .setOutputCol("indexedLabel")

// Train a RandomForest model.
val rf: RandomForestClassifier = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("featuresRaw")
  .setMaxBins(30)

// add the params, unique to this classifier
val paramGrid = new ParamGridBuilder()
  .addGrid(rf.numTrees, Array(5))
  .addGrid(rf.maxDepth, Array(5))
  .build()

// Treat the Pipeline as an Estimator, to jointly choose parameters for all Pipeline stages.
val evaluator = new BinaryClassificationEvaluator()
  .setMetricName("areaUnderROC")
  .setLabelCol("indexedLabel")

Run Code Online (Sandbox Code Playgroud)

如果管道步骤被分成变换器管道(VectorAssembler + StringIndexer)和第二个分类器管道,并且如果在两个管道之间删除不必要的列,则训练成功.这意味着重新使用模型,必须在训练后保存两个PipelineModel,并且必须引入中间预处理步骤.