san*_*dev 6 apache-spark apache-spark-sql apache-spark-dataset
我使用:
dataset.withColumn("lead",lead(dataset.col(start_date),1).over(orderBy(start_date)));
Run Code Online (Sandbox Code Playgroud)
`我只想按trackId添加组,因此可以像任何agg函数一样领导每个组的工作:
+----------+---------------------------------------------+
| trackId | start_time | end_time | lead |
+-----+--------------------------------------------------+
| 1 | 12:00:00 | 12:04:00 | 12:05:00 |
+----------+---------------------------------------------+
| 1 | 12:05:00 | 12:08:00 | 12:20:00 |
+----------+---------------------------------------------+
| 1 | 12:20:00 | 12:22:00 | null |
+----------+---------------------------------------------+
| 2 | 13:00:00 | 13:04:00 | 13:05:00 |
+----------+---------------------------------------------+
| 2 | 13:05:00 | 13:08:00 | 13:20:00 |
+----------+---------------------------------------------+
| 2 | 13:20:00 | 13:22:00 | null |
+----------+---------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
有什么帮助吗?
您所缺少的只是Window
关键字和partitionBy
方法调用
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
dataset.withColumn("lead",lead(col("start_time"),1).over(Window.partitionBy("trackId").orderBy("start_time")))
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
2170 次 |
最近记录: |