SparkSQL中的日期和间隔添加

yak*_*out 4 sql apache-spark apache-spark-sql

我试图在spark-shell中的某些数据帧上执行一个简单的SQL查询,查询将一周的间隔添加到某个日期,如下所示:

原始查询:

scala> spark.sql("select Cast(table1.date2 as Date) + interval 1 week from table1").show()
Run Code Online (Sandbox Code Playgroud)

现在我做了一些测试:

scala> spark.sql("select Cast('1999-09-19' as Date) + interval 1 week from table1").show()
Run Code Online (Sandbox Code Playgroud)

我得到了正确的结果

+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1999-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
|                                                                  1999-09-26|
+----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

(只增加7天到19 = 26)

但是当我把这一年改为1997而不是1999年时,结果发生了变化!

scala> spark.sql("select Cast('1997-09-19' as Date) + interval 1 week from table1").show()

+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
|                                                                  1997-09-25|
+----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

为什么重新改变?不应该26岁不是25岁吗?

那么,这是一个与某些类型的计算损失有关的sparkSQL中的错误还是我遗漏了什么?

hi-*_*zir 7

这可能是当地时间转换的问题.INTERVAL将数据转换为TIMESTAMP然后返回DATE:

scala> spark.sql("SELECT CAST('1997-09-19' AS DATE) + INTERVAL 1 weeks").explain
== Physical Plan ==
*Project [10130 AS CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)#19]
+- Scan OneRowRelation[]
Run Code Online (Sandbox Code Playgroud)

(注意第二个和第三个CASTs)并且已知Spark 在处理时间戳时不可取的.

DATE_ADD 应表现出更稳定的行为:

scala> spark.sql("SELECT DATE_ADD(CAST('1997-09-19' AS DATE), 7)").explain
== Physical Plan ==
*Project [10130 AS date_add(CAST(1997-09-19 AS DATE), 7)#27]
+- Scan OneRowRelation[]
Run Code Online (Sandbox Code Playgroud)

  • 不一致:如果您有一个跨越两个时区的集群,则时间戳到日期的转换完全崩溃(除非您每次都使用具有显式时区的方法). (3认同)