我有一个 PySpark 数据框,如下所示:
Id timestamp col1 col2
abc 789 0 1
def 456 1 0
abc 123 1 0
def 321 0 1
Run Code Online (Sandbox Code Playgroud)
我想按 ID 列进行分组或分区,然后应根据时间戳的顺序创建 col1 和 col2 的列表。
Id timestamp col1 col2
abc [123,789] [1,0] [0,1]
def [321,456] [0,1] [1,0]
Run Code Online (Sandbox Code Playgroud)
我的做法:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
window_spec = W.partitionBy("id").orderBy('timestamp')
ranged_spec = window_spec.rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
df1 = df.withColumn("col1", F.collect_list("reco").over(window_spec))\
.withColumn("col2", F.collect_list("score").over(window_spec))\
df1.show()
Run Code Online (Sandbox Code Playgroud)
但这并没有返回 col1 和 col2 的列表。