如何在 Spark 3.0+ 中获取一年中的第几周?

ATi*_*our 5 scala apache-spark apache-spark-sql

我正在尝试创建一个包含日、月等列的日历文件。以下代码工作正常,但我找不到一种干净的方法来提取一年中的星期(1-52)。在 中spark 3.0+,以下代码行不起作用:.withColumn("week_of_year", date_format(col("day_id"), "W"))

我知道我可以创建一个视图/表,然后对其运行 SQL 查询来提取week_of_year,但有没有更好的方法来做到这一点?`

df.withColumn("day_id", to_date(col("day_id"), date_fmt))
.withColumn("week_day", date_format(col("day_id"), "EEEE"))
.withColumn("month_of_year", date_format(col("day_id"), "M"))
.withColumn("year", date_format(col("day_id"), "y"))
.withColumn("day_of_month", date_format(col("day_id"), "d"))
.withColumn("quarter_of_year", date_format(col("day_id"), "Q"))
Run Code Online (Sandbox Code Playgroud)

SCo*_*uto 5

Spark 3+ 似乎不再支持这些模式

Caused by: java.lang.IllegalArgumentException: All week-based patterns are unsupported since Spark 3.0, detected: w, Please use the SQL function EXTRACT instead
Run Code Online (Sandbox Code Playgroud)

你可以使用这个:

 import org.apache.spark.sql.functions._

df.withColumn("week_of_year", weekofyear($"date"))
Run Code Online (Sandbox Code Playgroud)

测试

输入

    val df  = List("2021-05-15", "1985-10-05")
               .toDF("date")
               .withColumn("date", to_date($"date", "yyyy-MM-dd")

df.show
    +----------+
    |      date|
    +----------+
    |2021-05-15|
    |1985-10-05|
    +----------+
Run Code Online (Sandbox Code Playgroud)

输出

 df.withColumn("week_of_year", weekofyear($"date")).show
+----------+------------+
|      date|week_of_year|
+----------+------------+
|2021-05-15|          19|
|1985-10-05|          40|
+----------+------------+
Run Code Online (Sandbox Code Playgroud)