spark数据帧将列名称连接到值

Dar*_*ark 2 dataframe apache-spark

我有一个数据框,我想以每行包含列名称的方式进行修改.例如 :

FirstName LastName
Jhon       Doe
David      Lue
Run Code Online (Sandbox Code Playgroud)

创造了以下内容

(FirstName=Jhon,LastName=Doe)
(FirstName=David,LastName=Lue)
Run Code Online (Sandbox Code Playgroud)

我设法为df做了2列

val x = df.map { row => (names(0) + "=" +row(0) , names(1)+"="+rows(1)}
Run Code Online (Sandbox Code Playgroud)

但是我怎么能用for循环任意数量的列呢?

谢谢

Dan*_*ula 9

一种选择是在列名上使用foldLeft:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame

import sqlContext.implicits._
val df = Seq(
  ("John", "Doe"),
  ("David", "Lue")
).toDF("first_name", "last_name")

val x = df.columns.foldLeft(df) {
  (acc: DataFrame, colName: String) => 
    acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}

x.show()
Run Code Online (Sandbox Code Playgroud)

导致:

+----------------+-------------+
|      first_name|    last_name|
+----------------+-------------+
| first_name=John|last_name=Doe|
|first_name=David|last_name=Lue|
+----------------+-------------+
Run Code Online (Sandbox Code Playgroud)

如果您想将其转换为元组的RDD,则可以在其上调用地图:

x.rdd.map(r => (r.getString(0), r.getString(1)))
Run Code Online (Sandbox Code Playgroud)

甚至使用Spark SQL的类型化API:

x.as[(String, String)].rdd
Run Code Online (Sandbox Code Playgroud)