aMK*_*MKa 41 scala apache-spark apache-spark-mllib
我正在使用Spark 2.1.1处理具有~2000个特征的数据集,并尝试创建一个基本的ML管道,包括一些变形金刚和分类器.
让我们假设为了简单起见,我正在使用的Pipeline包含一个VectorAssembler,StringIndexer和一个Classifier,这将是一个相当常见的用例.
// Pipeline elements
val assmbleFeatures: VectorAssembler = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("featuresRaw")
val labelIndexer: StringIndexer = new StringIndexer()
.setInputCol("TARGET")
.setOutputCol("indexedLabel")
// Train a RandomForest model.
val rf: RandomForestClassifier = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("featuresRaw")
.setMaxBins(30)
// add the params, unique to this classifier
val paramGrid = new ParamGridBuilder()
.addGrid(rf.numTrees, Array(5))
.addGrid(rf.maxDepth, Array(5))
.build()
// Treat the Pipeline as an Estimator, to jointly choose parameters for all Pipeline stages.
val evaluator = new BinaryClassificationEvaluator()
.setMetricName("areaUnderROC")
.setLabelCol("indexedLabel")
Run Code Online (Sandbox Code Playgroud)
如果管道步骤被分成变换器管道(VectorAssembler + StringIndexer)和第二个分类器管道,并且如果在两个管道之间删除不必要的列,则训练成功.这意味着重新使用模型,必须在训练后保存两个PipelineModel,并且必须引入中间预处理步骤.
// Split indexers and forest in two Pipelines.
val prePipeline = new Pipeline().setStages(Array(labelIndexer, assmbleFeatures)).fit(dfTrain)
// Transform data and drop all columns, except those needed for training
val dfTrainT = prePipeline.transform(dfTrain)
val columnsToDrop = dfTrainT.columns.filter(col => !Array("featuresRaw", "indexedLabel").contains(col))
val dfTrainRdy = dfTrainT.drop(columnsToDrop:_*)
val mainPipeline = new Pipeline().setStages(Array(rf))
val cv = new CrossValidator()
.setEstimator(mainPipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(2)
val bestModel = cv.fit(dfTrainRdy).bestModel.asInstanceOf[PipelineModel]
Run Code Online (Sandbox Code Playgroud)
(imho)更清洁的解决方案是将所有管道阶段合并为一个管道.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, assmbleFeatures, rf))
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(2)
// This will fail!
val bestModel = cv.fit(dfTrain).bestModel.asInstanceOf[PipelineModel]
Run Code Online (Sandbox Code Playgroud)
但是,将所有PipelineStages放入一个Pipeline会导致以下异常,可能是由于此 PR最终将解决的问题:
错误CodeGenerator:无法编译:org.codehaus.janino.JaninoRuntimeException:类org.apache.spark.sql.catalyst.expressions.GeneratedClass的常量池$ SpecificUnsafeProjection已超过JVM限制0xFFFF
这样做的原因是VectorAssembler有效地(在本例中)加倍了DataFrame中的数据量,因为没有可以丢弃不必要列的变换器.(参见spark pipe矢量汇编程序删除其他列)
该示例适用于golub数据集,并且必须执行以下预处理步骤:
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature._
import org.apache.spark.sql._
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
val df = spark.read.option("header", true).option("inferSchema", true).csv("/path/to/dataset/golub_merged.csv").drop("_c0").repartition(100)
// Those steps are necessary, otherwise training would fail either way
val colsToDrop = df.columns.take(5000)
val dfValid = df.withColumn("TARGET", df("TARGET_REAL").cast(DoubleType)).drop("TARGET_REAL").drop(colsToDrop:_*)
// Split df in train and test sets
val Array(dfTrain, dfTest) = dfValid.randomSplit(Array(0.7, 0.3))
// Feature columns are columns except "TARGET"
val featureColumns = dfTrain.columns.filter(col => col != "TARGET")
Run Code Online (Sandbox Code Playgroud)
由于我是Spark的新手,我不确定什么是解决这个问题的最佳方法.你会建议......
或者我错过了解决这个问题的重要事项(管道步骤,公关等)?
我实现了一个新的Transformer DroppingVectorAssembler,它会删除不必要的列,但是会抛出相同的异常.
除此之外,设置spark.sql.codegen.wholeStage以false不解决问题.
该janino错误是由于优化器过程中创建的常量变量的数量所致。JVM 中允许的常量变量的最大限制是 ((2^16) -1)。如果超过此限制,那么您将得到Constant pool for class ... has grown past JVM limit of 0xFFFF
解决此问题的 JIRA 是SPARK-18016,但目前仍在进行中。
VectorAssembler当您的代码必须在单个优化任务期间对数千列执行时, 您的代码很可能会在该阶段失败。
我针对此问题开发的解决方法是通过对列的子集进行操作来创建“向量的向量”,然后最后将结果汇总在一起以创建奇异特征向量。这可以防止任何单个优化任务超出 JVM 常量限制。它并不优雅,但我已经在达到 10k 列范围的数据集上使用了它。
此方法还允许您仍然保留单个管道,尽管它需要一些额外的步骤才能使其工作(创建子向量)。从子向量创建特征向量后,如果需要,可以删除原始源列。
示例代码:
// IMPORT DEPENDENCIES
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.{Pipeline, PipelineModel}
// Create first example dataframe
val exampleDF = spark.createDataFrame(Seq(
(1, 1, 2, 3, 8, 4, 5, 1, 3, 2, 0, 4, 2, 8, 1, 1, 2, 3, 8, 4, 5),
(2, 4, 3, 8, 7, 9, 8, 2, 3, 3, 2, 6, 5, 4, 2, 4, 3, 8, 7, 9, 8),
(3, 6, 1, 9, 2, 3, 6, 3, 8, 5, 1, 2, 3, 5, 3, 6, 1, 9, 2, 3, 6),
(4, 7, 8, 6, 9, 4, 5, 4, 9, 8, 2, 4, 9, 2, 4, 7, 8, 6, 9, 4, 5),
(5, 9, 2, 7, 8, 7, 3, 5, 3, 4, 8, 0, 6, 2, 5, 9, 2, 7, 8, 7, 3),
(6, 1, 1, 4, 2, 8, 4, 6, 3, 9, 8, 8, 9, 3, 6, 1, 1, 4, 2, 8, 4)
)).toDF("uid", "col1", "col2", "col3", "col4", "col5",
"col6", "col7", "col8", "col9", "colA", "colB",
"colC", "colD", "colE", "colF", "colG", "colH",
"colI", "colJ", "colK")
// Create multiple column lists using the sliding method
val Array(colList1, colList2, colList3, colList4) = exampleDF.columns.filter(_ != "uid").sliding(5,5).toArray
// Create a vector assembler for each column list
val colList1_assembler = new VectorAssembler().setInputCols(colList1).setOutputCol("colList1_vec")
val colList2_assembler = new VectorAssembler().setInputCols(colList2).setOutputCol("colList2_vec")
val colList3_assembler = new VectorAssembler().setInputCols(colList3).setOutputCol("colList3_vec")
val colList4_assembler = new VectorAssembler().setInputCols(colList4).setOutputCol("colList4_vec")
// Create a vector assembler using column list vectors as input
val features_assembler = new VectorAssembler().setInputCols(Array("colList1_vec","colList2_vec","colList3_vec","colList4_vec")).setOutputCol("features")
// Create the pipeline with column list vector assemblers first, then the final vector of vectors assembler last
val pipeline = new Pipeline().setStages(Array(colList1_assembler,colList2_assembler,colList3_assembler,colList4_assembler,features_assembler))
// Fit and transform the data
val featuresDF = pipeline.fit(exampleDF).transform(exampleDF)
// Get the number of features in "features" vector
val featureLength = (featuresDF.schema(featuresDF.schema.fieldIndex("features")).metadata.getMetadata("ml_attr").getLong("num_attrs"))
// Print number of features in "features vector"
print(featureLength)
Run Code Online (Sandbox Code Playgroud)
(注意:创建列列表的方法实际上应该以编程方式完成,但为了理解这个概念,我使这个示例保持简单。)
| 归档时间: |
|
| 查看次数: |
2367 次 |
| 最近记录: |