san*_*dev 6 apache-spark apache-spark-sql apache-spark-dataset
我使用:
dataset.withColumn("lead",lead(dataset.col(start_date),1).over(orderBy(start_date)));
Run Code Online (Sandbox Code Playgroud)
`我只想按trackId添加组,因此可以像任何agg函数一样领导每个组的工作:
+----------+---------------------------------------------+
| trackId | start_time | end_time | lead |
+-----+--------------------------------------------------+
| 1 | 12:00:00 | 12:04:00 | 12:05:00 |
+----------+---------------------------------------------+
| 1 | 12:05:00 | 12:08:00 | 12:20:00 |
+----------+---------------------------------------------+
| 1 | 12:20:00 | 12:22:00 | null |
+----------+---------------------------------------------+
| 2 | 13:00:00 | 13:04:00 | 13:05:00 |
+----------+---------------------------------------------+
| 2 | 13:05:00 | 13:08:00 | 13:20:00 |
+----------+---------------------------------------------+
| 2 | 13:20:00 | 13:22:00 | null |
+----------+---------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
有什么帮助吗?
您所缺少的只是Window关键字和partitionBy方法调用
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
dataset.withColumn("lead",lead(col("start_time"),1).over(Window.partitionBy("trackId").orderBy("start_time")))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2170 次 |
| 最近记录: |