Rux*_*ang 6 pipeline scala classification apache-spark
我正在使用spark ML管道在真正的宽桌上设置分类模型.这意味着我必须自动生成处理列的所有代码,而不是精确地键入每个列.我几乎是scala和spark的初学者.当我尝试执行以下操作时,我被困在VectorAssembler()部分:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Run Code Online (Sandbox Code Playgroud)
显然它没有用.我不知道我该怎么办才能使整个事情更加自动化,或者ML管道在这个时刻不会占用那么多列(我怀疑)?
小智 0
s"$quote$x$quote"除非列名包含引号,否则不应使用引号 ( )。尝试
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1867 次 |
| 最近记录: |