访问struct Spark SQL中字段的名称

use*_*459 4 scala apache-spark apache-spark-sql

我试图将结构的字段"提升"到数据框中的顶层,如下例所示:

case class A(a1: String, a2: String)
case class B(b1: String, b2: A)

val df = Seq(B("X",A("Y","Z"))).toDF

df.show    
+---+-----+
| b1|   b2|
+---+-----+
|  X|[Y,Z]|
+---+-----+

df.printSchema
root
 |-- b1: string (nullable = true)
 |-- b2: struct (nullable = true)
 |    |-- a1: string (nullable = true)
 |    |-- a2: string (nullable = true)

val lifted = df.withColumn("a1", $"b2.a1").withColumn("a2", $"b2.a2").drop("b2")

lifted.show
+---+---+---+
| b1| a1| a2|
+---+---+---+
|  X|  Y|  Z|
+---+---+---+

lifted.printSchema
 root
 |-- b1: string (nullable = true)
 |-- a1: string (nullable = true)
 |-- a2: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

这有效.我想创建一个小实用程序方法,为我做这个,可能通过pimping DataFrame来启用像df.lift("b2")这样的东西.

为此,我想我想要一种获取Struct中所有字段列表的方法.例如,将"b2"作为输入,返回["a1","a2"].我该怎么做呢?

eli*_*sah 6

如果我正确理解您的问题,您希望能够列出列b2的嵌套字段.

所以你需要过滤b2,访问StructTypeof b2,然后在fields(StructField)中映射列的名称:

import org.apache.spark.sql.types.StructType

val nested_fields = df.schema
                   .filter(c => c.name == "b2")
                   .flatMap(_.dataType.asInstanceOf[StructType].fields)
                   .map(_.name)

// nested_fields: Seq[String] = List(a1, a2)
Run Code Online (Sandbox Code Playgroud)