yak*_*out 4 sql apache-spark apache-spark-sql
我试图在spark-shell中的某些数据帧上执行一个简单的SQL查询,查询将一周的间隔添加到某个日期,如下所示:
原始查询:
scala> spark.sql("select Cast(table1.date2 as Date) + interval 1 week from table1").show()
Run Code Online (Sandbox Code Playgroud)
现在我做了一些测试:
scala> spark.sql("select Cast('1999-09-19' as Date) + interval 1 week from table1").show()
Run Code Online (Sandbox Code Playgroud)
我得到了正确的结果
+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1999-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
| 1999-09-26|
+----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
(只增加7天到19 = 26)
但是当我把这一年改为1997而不是1999年时,结果发生了变化!
scala> spark.sql("select Cast('1997-09-19' as Date) + interval 1 week from table1").show()
+----------------------------------------------------------------------------+
|CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)|
+----------------------------------------------------------------------------+
| 1997-09-25|
+----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
为什么重新改变?不应该26岁不是25岁吗?
那么,这是一个与某些类型的计算损失有关的sparkSQL中的错误还是我遗漏了什么?
这可能是当地时间转换的问题.INTERVAL将数据转换为TIMESTAMP然后返回DATE:
scala> spark.sql("SELECT CAST('1997-09-19' AS DATE) + INTERVAL 1 weeks").explain
== Physical Plan ==
*Project [10130 AS CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)#19]
+- Scan OneRowRelation[]
Run Code Online (Sandbox Code Playgroud)
(注意第二个和第三个CASTs)并且已知Spark 在处理时间戳时是 不可取的.
DATE_ADD 应表现出更稳定的行为:
scala> spark.sql("SELECT DATE_ADD(CAST('1997-09-19' AS DATE), 7)").explain
== Physical Plan ==
*Project [10130 AS date_add(CAST(1997-09-19 AS DATE), 7)#27]
+- Scan OneRowRelation[]
Run Code Online (Sandbox Code Playgroud)