pyspark collect_set或collect_list with groupby

Han*_*art 40 group-by list set collect pyspark

我怎样才能在之后使用collect_setcollect_list使用数据帧groupby.例如:df.groupby('key').collect_set('values').我收到一个错误:AttributeError: 'GroupedData' object has no attribute 'collect_set'

ksi*_*ndi 65

你需要使用agg.例:

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F

sc = SparkContext("local")

sqlContext = HiveContext(sc)

df = sqlContext.createDataFrame([
    ("a", None, None),
    ("a", "code1", None),
    ("a", "code2", "name2"),
], ["id", "code", "name"])

df.show()

+---+-----+-----+
| id| code| name|
+---+-----+-----+
|  a| null| null|
|  a|code1| null|
|  a|code2|name2|
+---+-----+-----+
Run Code Online (Sandbox Code Playgroud)

请注意,上面你必须创建一个HiveContext.有关处理不同Spark版本的信息,请参阅/sf/answers/2487036541/.

(df
  .groupby("id")
  .agg(F.collect_set("code"),
       F.collect_list("name"))
  .show())

+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
|  a|   [code1, code2]|           [name2]|
+---+-----------------+------------------+
Run Code Online (Sandbox Code Playgroud)

  • collect_set()包含不同的元素,collect_list()包含所有元素(null除外) (6认同)
  • 当我在列表中有多个列时,如何将 collect_list 的输出作为 dict 例如:agg(collect_list(struct(df.f1,df.f2,df.f3)))。每个组的输出应为 [f1:value,f2:value,f3:value]。 (3认同)