Geo*_*ler 3 struct complextype higher-order-functions apache-spark apache-spark-sql
如何transform使用 spark 高阶函数将结构数组再次转换为结构?
数据集:
case class Foo(thing1:String, thing2:String, thing3:String)
case class Baz(foo:Foo, other:String)
case class Bar(id:Int, bazes:Seq[Baz])
import spark.implicits._
val df = Seq(Bar(1, Seq(Baz(Foo("first", "second", "third"), "other"), Baz(Foo("1", "2", "3"), "else")))).toDF
df.printSchema
df.show(false)
Run Code Online (Sandbox Code Playgroud)
我想连接所有thing1, thign2, thing3但保留other每个bar.
一个简单的:
scala> df.withColumn("cleaned", expr("transform(bazes, x -> x)")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
只会把东西复制过来。
所需的连接操作:
df.withColumn("cleaned", expr("transform(bazes, x -> concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3))")).printSchema
Run Code Online (Sandbox Code Playgroud)
不幸的是,将从other列中删除所有值:
+---+----------------------------------------------------+-------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+-------------------------------+
|1 |[[[first, second, third], other], [[1, 2, 3], else]]|[first::second::third, 1::2::3]|
+---+----------------------------------------------------+-------------------------------+
Run Code Online (Sandbox Code Playgroud)
如何保留这些?试图保留元组:
df.withColumn("cleaned", expr("transform(bazes, x -> (concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3), x.other))")).printSchema
Run Code Online (Sandbox Code Playgroud)
失败:
.AnalysisException: cannot resolve 'named_struct('col1', concat(namedlambdavariable().`foo`.`thing1`, '::', namedlambdavariable().`foo`.`thing2`, '::', namedlambdavariable().`foo`.`thing3`), NamePlaceholder(), namedlambdavariable().`other`)' due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder; line 1 pos 22;
Run Code Online (Sandbox Code Playgroud)
所需的输出:
一个包含内容的新列:
[[第一::第二::第三,其他],[1::2::3,其他]
保留列 other
小智 6
通过这种方式,您可以实现您想要的输出。您不能直接访问其他值 bcoz foo 和其他共享相同的层次结构。所以你需要单独访问其他。
scala> df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).show(false)
+---+----------------------------------------------------+------------------------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
打印模式
scala> df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = true)
| | |-- col2: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
如果您还有任何与此相关的问题,请告诉我。