如何使用火花滞后和超前分组和排序

san*_*dev 6 apache-spark apache-spark-sql apache-spark-dataset

我使用:

dataset.withColumn("lead",lead(dataset.col(start_date),1).over(orderBy(start_date)));
Run Code Online (Sandbox Code Playgroud)

`我只想按trackId添加组,因此可以像任何agg函数一样领导每个组的工作:

+----------+---------------------------------------------+
|  trackId |  start_time    |  end_time   |      lead    |
+-----+--------------------------------------------------+
|  1       | 12:00:00       |   12:04:00  |     12:05:00 |
+----------+---------------------------------------------+
|  1       | 12:05:00       |   12:08:00  |    12:20:00  |  
+----------+---------------------------------------------+
|  1       | 12:20:00       |   12:22:00  |     null     | 
+----------+---------------------------------------------+
|  2       | 13:00:00       |   13:04:00  |    13:05:00 |
+----------+---------------------------------------------+
|  2       | 13:05:00       |   13:08:00  |    13:20:00  |  
+----------+---------------------------------------------+
|  2       | 13:20:00       |   13:22:00  |     null     | 
+----------+---------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

有什么帮助吗?

Ram*_*jan 6

您所缺少的只是Window关键字和partitionBy方法调用

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
dataset.withColumn("lead",lead(col("start_time"),1).over(Window.partitionBy("trackId").orderBy("start_time")))
Run Code Online (Sandbox Code Playgroud)