sat*_*ish 1 hive scala apache-spark apache-spark-sql
我需要从源表创建一个表(hive 表/spark 数据框),该表将多行用户的数据存储到单行列表中。
User table:
Schema: userid: string | transactiondate:string | charges: string |events:array<struct<name:string,value:string>>
----|------------|-------| ---------------------------------------
123 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"this"}]
123 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"last"}]
123 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"recent"}]
123 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"0"}]
456 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"this"}]
456 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"last"}]
456 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"recent"}]
456 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"0"}]
Run Code Online (Sandbox Code Playgroud)
输出表应该是
userid:String | concatenatedlist :List[Row]
-------|-----------------
123 | [[2017-09-01,20.00,[{"name":"chargeperiod","value":"this"}]],[2017-09-01,30.00,[{"name":"chargeperiod","value":"last"}]],[2017-09-01,20.00,[{"name":"chargeperiod","value":"recent"}]], [2017-09-01,30.00, [{"name":"chargeperiod","value":"0"}]]]
456 | [[2017-09-01,20.00,[{"name":"chargeperiod","value":"this"}]],[2017-09-01,30.00,[{"name":"chargeperiod","value":"last"}]],[2017-09-01,20.00,[{"name":"chargeperiod","value":"recent"}]], [2017-09-01,30.00, [{"name":"chargeperiod","value":"0"}]]]
Run Code Online (Sandbox Code Playgroud)
星火版本:1.6.2
小智 8
Seq(("1", "2017-02-01", "20.00", "abc"),
("1", "2017-02-01", "30.00", "abc2"),
("2", "2017-02-01", "20.00", "abc"),
("2", "2017-02-01", "30.00", "abc"))
.toDF("id", "date", "amt", "array")
df.withColumn("new", concat_ws(",", $"date", $"amt", $"array"))
.select("id", "new")
.groupBy("id")
.agg(concat_ws(",", collect_list("new")))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
9613 次 |
| 最近记录: |