mr *_* kw 19 apache-spark pyspark spark-dataframe
如何使用以下方法计算每组的累积总和DataFrame
abstraction
; 在PySpark
?
使用示例数据集如下:
df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")],
["time", "value", "class"] )
+----+-----+-----+
|time|value|class|
+----+-----+-----+
| 1| 2| a|
| 3| 2| a|
| 1| 3| b|
| 2| 2| a|
| 2| 3| b|
+----+-----+-----+
Run Code Online (Sandbox Code Playgroud)
我想value
为class
(有序)time
变量添加每个分组的累积和列.
mr *_* kw 47
这可以使用窗口函数和窗口范围中的Window.unboundedPreceding值的组合来完成,如下所示:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowval = (Window.partitionBy('class').orderBy('time')
.rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
Run Code Online (Sandbox Code Playgroud)
+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
| 1| 3| b| 3|
| 2| 3| b| 6|
| 1| 2| a| 2|
| 2| 2| a| 4|
| 3| 2| a| 6|
+----+-----+-----+-------+
Run Code Online (Sandbox Code Playgroud)
Sah*_*pal 18
对以前的答案进行更新。正确且准确的做法是:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowval = (Window.partitionBy('class').orderBy('time')
.rowsBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
Run Code Online (Sandbox Code Playgroud)
小智 5
我已经尝试过这种方法并且它对我有用。
from pyspark.sql import Window
from pyspark.sql import functions as f
import sys
cum_sum = DF.withColumn('cumsum', f.sum('value').over(Window.partitionBy('class').orderBy('time').rowsBetween(-sys.maxsize, 0)))
cum_sum.show()
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
14909 次 |
最近记录: |