Spark SQL嵌套withColumn

Jon*_*Jon 15 scala dataframe apache-spark udf

我有一个DataFrame,它有多个列,其中一些是结构.像这样的东西

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)
 |-- abc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- def: struct (nullable = true)
 |    |    |    |-- a: string (nullable = true)
 |    |    |    |-- b: integer (nullable = true)
 |    |    |    |-- c: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

我想UserDefinedFunction在列上应用一个baz替换baz功能baz,但我无法弄清楚如何做到这一点.这是一个所需输出的例子(注意baz现在是一个int)

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: int (nullable = true)
 |-- abc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- def: struct (nullable = true)
 |    |    |    |-- a: string (nullable = true)
 |    |    |    |-- b: integer (nullable = true)
 |    |    |    |-- c: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

它看起来DataFrame.withColumn只适用于顶级列,但不适用于嵌套列.我正在使用Scala来解决这个问题.

有人可以帮我解决这个问题吗?

谢谢

Rap*_*oth 19

这很简单,只需使用一个点来选择嵌套结构,例如$"foo.baz":

case class Foo(bar:String,baz:String)
case class Record(foo:Foo)

val df = Seq(
   Record(Foo("Hi","There"))
).toDF()


df.printSchema

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)


val myUDF = udf((s:String) => {
 // do something with s 
  s.toUpperCase
})


df
.withColumn("udfResult",myUDF($"foo.baz"))
.show

+----------+---------+
|       foo|udfResult|
+----------+---------+
|[Hi,There]|    THERE|
+----------+---------+
Run Code Online (Sandbox Code Playgroud)

如果要将UDF的结果添加到现有结构中foo,即获取:

root
 |-- foo: struct (nullable = false)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)
 |    |-- udfResult: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

有两种选择:

withColumn:

df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")
Run Code Online (Sandbox Code Playgroud)

select:

df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))
Run Code Online (Sandbox Code Playgroud)

编辑:更换与来自UDF的结果结构中的现有属性:不幸的是,这并没有工作:

df
.withColumn("foo.baz",myUDF($"foo.baz")) 
Run Code Online (Sandbox Code Playgroud)

但可以这样做:

// get all columns except foo.baz
val structCols = df.select($"foo.*")
    .columns
    .filter(_!="baz")
    .map(name => col("foo."+name))

df.withColumn(
    "foo",
    struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)
Run Code Online (Sandbox Code Playgroud)

  • @Gaurav Shah 导入 o​​rg.apache.spark.sql.functions._ (2认同)