使用DataFrame按组分组的Python Spark累积和

mr *_* kw 19 apache-spark pyspark spark-dataframe

如何使用以下方法计算每组的累积总和DataFrame abstraction; 在PySpark

使用示例数据集如下:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], 
                                 ["time", "value", "class"] )

+----+-----+-----+
|time|value|class|
+----+-----+-----+
|   1|    2|    a|
|   3|    2|    a|
|   1|    3|    b|
|   2|    2|    a|
|   2|    3|    b|
+----+-----+-----+
Run Code Online (Sandbox Code Playgroud)

我想valueclass(有序)time变量添加每个分组的累积和列.

mr *_* kw 47

这可以使用窗口函数和窗口范围中的Window.unboundedPreceding值的组合来完成,如下所示:

from pyspark.sql import Window
from pyspark.sql import functions as F

windowval = (Window.partitionBy('class').orderBy('time')
             .rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
Run Code Online (Sandbox Code Playgroud)
+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
|   1|    3|    b|      3|
|   2|    3|    b|      6|
|   1|    2|    a|      2|
|   2|    2|    a|      4|
|   3|    2|    a|      6|
+----+-----+-----+-------+
Run Code Online (Sandbox Code Playgroud)

  • 这个答案不正确。“rowsBetween”是正确使用的而不是“rangeBetween”。如果“orderBy”列在每个分区中都有重复的值,则总和将不正确。 (7认同)

Sah*_*pal 18

对以前的答案进行更新。正确且准确的做法是:

from pyspark.sql import Window
from pyspark.sql import functions as F

windowval = (Window.partitionBy('class').orderBy('time')
             .rowsBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
Run Code Online (Sandbox Code Playgroud)


小智 5

我已经尝试过这种方法并且它对我有用。

from pyspark.sql import Window

from pyspark.sql import functions as f

import sys

cum_sum = DF.withColumn('cumsum', f.sum('value').over(Window.partitionBy('class').orderBy('time').rowsBetween(-sys.maxsize, 0)))
cum_sum.show()
Run Code Online (Sandbox Code Playgroud)