将数据框转换为列名和值的结构数组

Sri*_*vas 0 scala apache-spark apache-spark-sql

假设我有一个这样的数据框

val customer = Seq(
    ("C1", "Jackie Chan", 50, "Dayton", "M"),
    ("C2", "Harry Smith", 30, "Beavercreek", "M"),
    ("C3", "Ellen Smith", 28, "Beavercreek", "F"),
    ("C4", "John Chan", 26, "Dayton","M")
  ).toDF("cid","name","age","city","sex")
Run Code Online (Sandbox Code Playgroud)

我怎样才能在一列中获得 cid 值并array < struct < column_name, column_value > >在火花中获得其余的值

Oli*_*Oli 5

唯一的困难是数组必须包含相同类型的元素。因此,您需要将所有列转换为字符串,然后再将它们放入数组(age在您的情况下为 int)。这是它的过程:

val cols = customer.columns.tail
val result = customer.select('cid,
    array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")

result.show(false)

+---+-----------------------------------------------------------+
|cid|array                                                      |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]]     |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]]       |
+---+-----------------------------------------------------------+

result.printSchema()

root
 |-- cid: string (nullable = true)
 |-- array: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- name: string (nullable = false)
 |    |    |-- value: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)