luc*_*ucy 2 sql hive scala cumulative-sum apache-spark
我想在Spark中做累计和。这是寄存器表(输入):
+---------------+-------------------+----+----+----+
| product_id| date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:02:27:01|3-46| 53| 52|
|4008607333T.upf|2017-12-13:02:27:03|3-47| 53| 52|
|4008607333T.upf|2017-12-13:02:27:08|3-46| 53| 52|
|4008607333T.upf|2017-12-13:02:28:01|3-47| 53| 52|
|4008607333T.upf|2017-12-13:02:28:07|3-46| 15| 1|
+---------------+-------------------+----+----+----+
Run Code Online (Sandbox Code Playgroud)
配置单元查询:
select *, SUM(val1) over ( Partition by product_id, ack order by date_time rows between unbounded preceding and current row ) val1_sum, SUM(val2) over ( Partition by product_id, ack order by date_time rows between unbounded preceding and current row ) val2_sum from test
Run Code Online (Sandbox Code Playgroud)
输出:
+---------------+-------------------+----+----+----+-------+--------+
| product_id| date_time| ack|val1|val2|val_sum|val2_sum|
+---------------+-------------------+----+----+----+-------+--------+
|4008607333T.upf|2017-12-13:02:27:01|3-46| 53| 52| 53| 52|
|4008607333T.upf|2017-12-13:02:27:08|3-46| 53| 52| 106| 104|
|4008607333T.upf|2017-12-13:02:28:07|3-46| 15| 1| 121| 105|
|4008607333T.upf|2017-12-13:02:27:03|3-47| 53| 52| 53| 52|
|4008607333T.upf|2017-12-13:02:28:01|3-47| 53| 52| 106| 104|
+---------------+-------------------+----+----+----+-------+--------+
Run Code Online (Sandbox Code Playgroud)
使用Spark逻辑,我在上面的输出中得到相同的结果:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('product_id, 'ack).orderBy('date_time)
import org.apache.spark.sql.functions._
val newDf = inputDF.withColumn("val_sum", sum('val1) over w).withColumn("val2_sum", sum('val2) over w)
newDf.show
Run Code Online (Sandbox Code Playgroud)
但是,当我尝试此逻辑时,火花群集val_sum值将是累积总和的一半,并且有时有所不同。我不知道为什么它会在火花群集上发生。是由于分区造成的吗?
如何对Spark集群上的列进行累加总和?
要使用DataFrame API获取累积和,您应该使用rowsBetweenwindow方法。在Spark 2.1及更高版本中,如下创建窗口:
val w = Window.partitionBy($"product_id", $"ack")
.orderBy($"date_time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
Run Code Online (Sandbox Code Playgroud)
这将告诉Spark使用从分区开始到当前行的值。使用旧版本的Spark,可以rowsBetween(Long.MinValue, 0)达到相同的效果。
要使用该窗口,请使用与以前相同的方法:
val newDf = inputDF.withColumn("val_sum", sum($"val1").over(w))
.withColumn("val2_sum", sum($"val2").over(w))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3402 次 |
| 最近记录: |