我是Scala的新手.目前我正在尝试在每月下滑的12个月期间汇总火花中的订单数据.
下面是我的数据的简单示例,我尝试对其进行格式化,以便您可以轻松地对其进行测试
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
var sample = Seq(("C1","01/01/2016", 20), ("C1","02/01/2016", 5),
("C1","03/01/2016", 2), ("C1","04/01/2016", 3), ("C1","05/01/2017", 5),
("C1","08/01/2017", 5), ("C1","01/02/2017", 10), ("C1","01/02/2017", 10),
("C1","01/03/2017", 10)).toDF("id","order_date", "orders")
sample = sample.withColumn("order_date",
to_date(unix_timestamp($"order_date", "dd/MM/yyyy").cast("timestamp")))
sample.show
Run Code Online (Sandbox Code Playgroud)
+---+----------+------+
| id|order_date|orders|
+---+----------+------+
| C1|2016-01-01| 20|
| C1|2016-01-02| 5|
| C1|2016-01-03| 2|
| C1|2016-01-04| 3|
| C1|2017-01-05| 5|
| C1|2017-01-08| 5|
| C1|2017-02-01| 10|
| C1|2017-02-01| 10|
| C1|2017-03-01| 10|
+---+----------+------+
Run Code Online (Sandbox Code Playgroud)
强加给我的结果如下.
id period_start period_end rolling
C1 2015-01-01 2016-01-01 30
C1 2016-01-01 2017-01-01 …Run Code Online (Sandbox Code Playgroud)