Spark DataFrame按列将列值聚合到List中

C.A*_*C.A 7 dataframe apache-spark apache-spark-sql

我有一个DataFrame看起来像这样:

+-----------------+-------+
|Id               | value |
+-----------------+-------+
|             1622| 139685|
|             1622| 182118|
|             1622| 127955|
|             3837|3224815|
|             1622| 727761|
|             1622| 155875|
|             3837|1504923|
|             1622| 139684|
+-----------------+-------+
Run Code Online (Sandbox Code Playgroud)

我想把它变成:

    +-----------------+-------------------------------------------+
    |Id               | value                                     |
    +-----------------+-------------------------------------------+
    |             1622|139685,182118,127955,727761,155875,139684  |
    |             3837|3224815,1504923                            |
    +-----------------+-------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

DataFrame只能用于功能,还是需要将其转换为RDD

Dav*_*fin 8

它可以通过DataFrameAPI实现.尝试:

df.groupBy(col("Id"))
  .agg(collect_list(col("value")) as "value")
Run Code Online (Sandbox Code Playgroud)

如果不是Array你想要一个String分开的,,那么试试这个:

df.groupBy(col("Id"))
  .agg(collect_list(col("value")) as "value")
  .withColumn("value", concat_ws(",", col("value")))
Run Code Online (Sandbox Code Playgroud)