让我们采取以下玩具问题,我有以下案例类:
case class Order(id: String, name: String, status: String)
case class TruncatedOrder(id: String)
case class Org(name: String, ord: Seq[TruncatedOrder])
Run Code Online (Sandbox Code Playgroud)
我现在有了以下定义的变量
val ordersDF = Seq(Order("or1", "stuff", "shipped"), Order("or2", "thigns", "delivered") , Order("or3", "thingamabobs", "never received"), Order("or4", "???", "what?")).toDS()
val orgsDF = Seq(Org("tupper", Seq(TruncatedOrder("or1"), TruncatedOrder("or2"), TruncatedOrder("or3"))), Org("ware", Seq(TruncatedOrder("or3"), TruncatedOrder("or4")))).toDS()
Run Code Online (Sandbox Code Playgroud)
我想要的是例如具有如下的数据点
Ord("tupper", Array(Joined("or1", "stuff", "shipped"), Joined("or2", "things", "delivered"), ...)
我想知道如何格式化我的join语句和过滤语句.
这是我如何将数据转换为我想要的格式的方法。这个答案受到@ulrich 和@Mariusz 提供的答案的很大启发。
val ud = udf((col: String, name: String, status: String) => { Seq(col, name, status)})
orgsDF
.select($"name".as("ordName"),explode($"ord.id"))
.join(ordersDF, $"col" === $"id").drop($"id")
.select($"ordName", ud($"col", $"name", $"status"))
.groupBy($"ordName")
.agg(collect_set($"order"))
.show()
+-------+--------------------------------------------------------------------------------------------------------------------------+
|ordName|orders |
+-------+--------------------------------------------------------------------------------------------------------------------------+
|ware |[WrappedArray(or4, ???, what?), WrappedArray(or3, thingamabobs, never received)] |
|tupper |[WrappedArray(or1, stuff, shipped), WrappedArray(or2, thigns, delivered), WrappedArray(or3, thingamabobs, never received)]|
+-------+--------------------------------------------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2186 次 |
| 最近记录: |