使用 PySpark 将天数分组为数周

Soz*_*ron 5 apache-spark apache-spark-sql pyspark-sql databricks

我最近在 BigQuery 中对日期进行分组时遇到了类似的查询DATE_ADD 或 DATE_DIFF 错误,但我想知道如何在 PySpark 中执行此操作,因为我是新手

day         bitcoin_total   dash_total
2009-01-03  1               0
2009-01-09  14              0
2009-01-10  61              0
Run Code Online (Sandbox Code Playgroud)

理想的结果是一周开始的日期(可能是星期一或星期日,以哪个为准)

day         bitcoin_total   dash_total
2008-12-28  1               0
2009-01-04  75              0
Run Code Online (Sandbox Code Playgroud)

下面的代码按数字返回周数,总数似乎不对。我似乎无法复制 .agg(sum()) 返回的总数,我什至无法添加第二个总数 (dash_total)。我试过了.col("dash_total")有没有办法将几天分成几个星期?

from pyspark.sql.functions import weekofyear, sum

(df
    .groupBy(weekofyear("day").alias("date_by_week"))
    .agg(sum("bitcoin_total"))
    .orderBy("date_by_week")
    .show())
Run Code Online (Sandbox Code Playgroud)

我在 Databricks 上运行 Spark。

Shu*_*Shu 6

使用date_sub,next_dayspark 中的函数尝试这种方法。

解释:

date_sub(
        next_day(col("day"),"sunday"), //get next sunday date
   7)) //substract week from the date
Run Code Online (Sandbox Code Playgroud)

例子:

In pyspark:

from pyspark.sql.functions import *
df = sc.parallelize([("2009-01-03","1","0"),("2009-01-09","14","0"),("2009-01-10","61","0")]).toDF(["day","bitcoin_total","dash_total"])
df.withColumn("week_strt_day",date_sub(next_day(col("day"),"sunday"),7)).groupBy("week_strt_day").agg(sum("bitcoin_total").cast("int").alias("bitcoin_total"),sum("dash_total").cast("int").alias("dash_total")).orderBy("week_strt_day").show()
Run Code Online (Sandbox Code Playgroud)

Result:

+-------------+-------------+----------+
|week_strt_day|bitcoin_total|dash_total|
+-------------+-------------+----------+
|   2008-12-28|            1|         0|
|   2009-01-04|           75|         0|
+-------------+-------------+----------+
Run Code Online (Sandbox Code Playgroud)

In scala:

import org.apache.spark.sql.functions._
val df=Seq(("2009-01-03","1","0"),("2009-01-09","14","0"),("2009-01-10","61","0")).toDF("day","bitcoin_total","dash_total") 
df.withColumn("week_strt_day",date_sub(next_day('day,"sunday"),7)).groupBy("week_strt_day").agg(sum("bitcoin_total").cast("int").alias("bitcoin_total"),sum("dash_total").cast("int").alias("dash_total")).orderBy("week_strt_day").show()
Run Code Online (Sandbox Code Playgroud)

Result:

+-------------+-------------+----------+
|week_strt_day|bitcoin_total|dash_total|
+-------------+-------------+----------+
|   2008-12-28|            1|         0|
|   2009-01-04|           75|         0|
+-------------+-------------+----------+
Run Code Online (Sandbox Code Playgroud)