Spark Scala:如何将Dataframe [vector]转换为DataFrame [f1:Double,...,fn:Double]]

mt8*_*t88 4 scala apache-spark apache-spark-sql apache-spark-ml

我只是使用标准缩放器来规范我的ML应用程序的功能.选择缩放功能后,我想将其转换回双打数据帧,尽管我的矢量长度是任意的.我知道如何使用特定的3个功能

myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
Run Code Online (Sandbox Code Playgroud)

但不是任意数量的功能.是否有捷径可寻?

例:

val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double] 
Run Code Online (Sandbox Code Playgroud)

编辑

我在创建数据帧时发现了如何解压缩到列名,但是仍然无法将向量转换为创建数据帧所需的序列:

finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)
Run Code Online (Sandbox Code Playgroud)

zer*_*323 16

一种可能的方法与此类似

import org.apache.spark.sql.functions.udf

// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector

// Get size of the vector
val n = testDF.first.getAs[Vector](0).size

// Simple helper to convert vector to array<double> 
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic

// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Run Code Online (Sandbox Code Playgroud)

如果您事先知道列的列表,可以稍微简化一下:

val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
Run Code Online (Sandbox Code Playgroud)

对于Python等效,请参阅如何将Vector拆分为列 - 使用PySpark.