Jon*_*Jon 15 scala dataframe apache-spark udf
我有一个DataFrame,它有多个列,其中一些是结构.像这样的东西
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
我想UserDefinedFunction在列上应用一个baz替换baz功能baz,但我无法弄清楚如何做到这一点.这是一个所需输出的例子(注意baz现在是一个int)
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: int (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
它看起来DataFrame.withColumn只适用于顶级列,但不适用于嵌套列.我正在使用Scala来解决这个问题.
有人可以帮我解决这个问题吗?
谢谢
Rap*_*oth 19
这很简单,只需使用一个点来选择嵌套结构,例如$"foo.baz":
case class Foo(bar:String,baz:String)
case class Record(foo:Foo)
val df = Seq(
Record(Foo("Hi","There"))
).toDF()
df.printSchema
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
val myUDF = udf((s:String) => {
// do something with s
s.toUpperCase
})
df
.withColumn("udfResult",myUDF($"foo.baz"))
.show
+----------+---------+
| foo|udfResult|
+----------+---------+
|[Hi,There]| THERE|
+----------+---------+
Run Code Online (Sandbox Code Playgroud)
如果要将UDF的结果添加到现有结构中foo,即获取:
root
|-- foo: struct (nullable = false)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
| |-- udfResult: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
有两种选择:
用withColumn:
df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")
Run Code Online (Sandbox Code Playgroud)
用select:
df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))
Run Code Online (Sandbox Code Playgroud)
编辑:更换与来自UDF的结果结构中的现有属性:不幸的是,这并没有工作:
df
.withColumn("foo.baz",myUDF($"foo.baz"))
Run Code Online (Sandbox Code Playgroud)
但可以这样做:
// get all columns except foo.baz
val structCols = df.select($"foo.*")
.columns
.filter(_!="baz")
.map(name => col("foo."+name))
df.withColumn(
"foo",
struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5824 次 |
| 最近记录: |