Pyspark - 获取具有条件的列的累积和

LaS*_*Sul 4 python apache-spark pyspark

我有一个包含卡、时间和金额的数据框,我需要在一个月的窗口内汇总卡的金额(总和和计数)。

数据如下:

+--------------------+-------------------+------------+
|            card_uid|               date|amount_local|
+--------------------+-------------------+------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|
|card_0026uGZQwZQd...|2016-07-06 12:17:57|       16.19|
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|
|card_005gBxyiDc6b...|2016-11-12 11:18:29|       12.99|
Run Code Online (Sandbox Code Playgroud)

这就是我到目前为止所做的。

+--------------------+-------------------+------------+----------------+
|            card_uid|               date|amount_local|duration_cum_sum|
+--------------------+-------------------+------------+----------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|            8.99|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|           16.19|
|card_0026uGZQwZQd...|2016-07-06 12:17:57|       16.19|           32.38|
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|            58.8|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|           27.95|
|card_005gBxyiDc6b...|2016-11-12 11:18:29|       12.99|           40.94|
Run Code Online (Sandbox Code Playgroud)

窗口函数如下:

partition = Window.partitionBy("card_uid").orderBy("date")

previousTransactionDate = data.withColumn("previous_tr_time", lag(data.date).over(partition)).select("transaction_id", "card_uid", "date", "previous_tr_time") 

df_cum_sum = data.withColumn("duration_cum_sum", sum('amount_local').over(partition))

df_cum_sum.orderBy("card_uid","date").select("card_uid", "date", "amount_local", "duration_cum_sum").show()
Run Code Online (Sandbox Code Playgroud)

但我唯一想补充的是两件事:

  • 仅当日期小于一个月时才以相同方式聚合
  • 为 cum_sum输入零而不是相同的金额

所以所需的输出如下所示:

+--------------------+-------------------+------------+----------------+
|            card_uid|               date|amount_local|duration_cum_sum|
+--------------------+-------------------+------------+----------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|               0|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|               0|
|card_0026uGZQwZQd...|2016-05-12 12:17:57|        4.00|           16.19|
|card_0026uGZQwZQd...|2016-06-06 12:23:51|       16.19|            4.00| => Only 4 because de 16.19 was more than one month ago
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|               0|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|               0|
|card_005gBxyiDc6b...|2016-09-12 11:18:29|       12.99|           27.95| => Previous amount 
|card_005gBxyiDc6b...|2016-09-22 14:25:44|       23.99|           40.94| => 27.95 + 12.99
Run Code Online (Sandbox Code Playgroud)

我无法 groupBy card_uid 因为我需要与原始行数相同的行数才能链接到另一个表

Man*_*ngh 7

您需要一个日期滚动窗口,窗口范围从过去 30 天到前一天。由于间隔函数不适用于窗口,因此您可以将日期转换为长值,并使用天长值创建窗口范围。

from pyspark.sql.functions import *
days = lambda i: i * 86400 

partition = Window.partitionBy("card_uid").orderBy(col("date").cast("timestamp").cast("long")).rangeBetween(days(-30), days(-1))

df_cum_sum = data.withColumn("duration_cum_sum",sum(col('amount_local')).over(partition))\
                 .fillna(0,subset=['duration_cum_sum'])
df_cum_sum.show()
Run Code Online (Sandbox Code Playgroud)