Jon*_*han 7 group-by aggregate spark-dataframe
我有一个与此类似的问题,但由collect_list操作的列数由名称列表给出。例如:
scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
| A| D1| T0| P1|
| A| D0| T1| P2|
| B| Y1| T0| P3|
| B| Y2| T2| P3|
| C| H1| T0| P5|
| C| H0| T9| P5|
| B| Y0| T1| P2|
| B| H1| T3| P6|
| D| H1| T2| P4|
+---+-----+----+-----+
scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)
scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]
scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
| B| [Y1, Y2, Y0, H1]| [T0, T2, T1, T3]| [P3, P3, P2, P6]|
| D| [H1]| [T2]| [P4]|
| C| [H1, H0]| [T0, T9]| [P5, P5]|
| A| [D1, D0]| [T0, T1]| [P1, P2]|
+---+-------------------+------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
有什么方法可以将collect_list应用于agg中的多个列,而无需事先知道combList中的元素数?
小智 2
您可以使用collect_list(struct(col1, col2)) AS元素。
例子:
df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema
df
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- cd_item: string (nullable = true)
|-- nm_item: string (nullable = true)
outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cd_item: string (nullable = true)
| | |-- nm_item: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4184 次 |
| 最近记录: |