Jon*_*Jon 15 scala dataframe apache-spark udf
我有一个DataFrame,它有多个列,其中一些是结构.像这样的东西
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
我想UserDefinedFunction
在列上应用一个baz
替换baz
功能baz
,但我无法弄清楚如何做到这一点.这是一个所需输出的例子(注意baz
现在是一个int
)
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: int (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
它看起来DataFrame.withColumn
只适用于顶级列,但不适用于嵌套列.我正在使用Scala来解决这个问题.
有人可以帮我解决这个问题吗?
谢谢
Rap*_*oth 19
这很简单,只需使用一个点来选择嵌套结构,例如$"foo.baz"
:
case class Foo(bar:String,baz:String)
case class Record(foo:Foo)
val df = Seq(
Record(Foo("Hi","There"))
).toDF()
df.printSchema
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
val myUDF = udf((s:String) => {
// do something with s
s.toUpperCase
})
df
.withColumn("udfResult",myUDF($"foo.baz"))
.show
+----------+---------+
| foo|udfResult|
+----------+---------+
|[Hi,There]| THERE|
+----------+---------+
Run Code Online (Sandbox Code Playgroud)
如果要将UDF的结果添加到现有结构中foo
,即获取:
root
|-- foo: struct (nullable = false)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
| |-- udfResult: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
有两种选择:
用withColumn
:
df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")
Run Code Online (Sandbox Code Playgroud)
用select
:
df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))
Run Code Online (Sandbox Code Playgroud)
编辑:更换与来自UDF的结果结构中的现有属性:不幸的是,这并没有工作:
df
.withColumn("foo.baz",myUDF($"foo.baz"))
Run Code Online (Sandbox Code Playgroud)
但可以这样做:
// get all columns except foo.baz
val structCols = df.select($"foo.*")
.columns
.filter(_!="baz")
.map(name => col("foo."+name))
df.withColumn(
"foo",
struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
5824 次 |
最近记录: |