如何使用Spark将Unix时间戳转换为给定的时区

alb*_*lcs 5 timezone apache-spark apache-spark-sql

我需要帮助,因为我似乎迷失了时区:)

我使用Spark 1.6.2

我有这样的时代:

+--------------+-------------------+-------------------+
|unix_timestamp|UTC                |Europe/Helsinki    |
+--------------+-------------------+-------------------+
|1491771599    |2017-04-09 20:59:59|2017-04-09 23:59:59|
|1491771600    |2017-04-09 21:00:00|2017-04-10 00:00:00|
|1491771601    |2017-04-09 21:00:01|2017-04-10 00:00:01|
+--------------+-------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)

Spark机器上的默认时区如下:

#timezone = DefaultTz:欧洲/布拉格,SparkUtilTz:欧洲/布拉格

输出

+--------------+-------------------+-------------------+
|unix_timestamp|UTC                |Europe/Helsinki    |
+--------------+-------------------+-------------------+
|1491771599    |2017-04-09 20:59:59|2017-04-09 23:59:59|
|1491771600    |2017-04-09 21:00:00|2017-04-10 00:00:00|
|1491771601    |2017-04-09 21:00:01|2017-04-10 00:00:01|
+--------------+-------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)

我想计算给定时区中按日期和小时分组的时间戳(现在是Europe / Helsinki + 3hours)。

我的期望:

+----------+---------+-----+
|date      |hour     |count|
+----------+---------+-----+
|2017-04-09|23       |1    |
|2017-04-10|0        |2    |
+----------+---------+-----+
Run Code Online (Sandbox Code Playgroud)

代码(使用from_utc_timestamp):

logger.info("#timezone = DefaultTz: {}, SparkUtilTz: {}", TimeZone.getDefault.getID, org.apache.spark.sql.catalyst.util.DateTimeUtils.defaultTimeZone.getID)
Run Code Online (Sandbox Code Playgroud)

我得到了什么:'(

+----------+---------+-----+
|date      |hour     |count|
+----------+---------+-----+
|2017-04-09|22       |1    |
|2017-04-09|23       |2    |
+----------+---------+-----+
Run Code Online (Sandbox Code Playgroud)

尝试用包装to_utc_timestamp

+----------+---------+-----+
|date      |hour     |count|
+----------+---------+-----+
|2017-04-09|23       |1    |
|2017-04-10|0        |2    |
+----------+---------+-----+
Run Code Online (Sandbox Code Playgroud)

我得到了:(

+----------+---------+-----+
|tradedate |tradehour|count|
+----------+---------+-----+
|2017-04-09|20       |1    |
|2017-04-09|21       |2    |
+----------+---------+-----+
Run Code Online (Sandbox Code Playgroud)

您知道什么是正确的解决方案吗?

在此先感谢您的帮助

Ram*_*jan 7

你的代码对我不起作用,所以我无法复制你得到的最后两个输出。

但我将为您提供一些有关如何实现预期输出的提示

我假设你已经dataframe

+--------------+---------------------+---------------------+
|unix_timestamp|UTC                  |Europe/Helsinki      |
+--------------+---------------------+---------------------+
|1491750899    |2017-04-09 20:59:59.0|2017-04-09 23:59:59.0|
|1491750900    |2017-04-09 21:00:00.0|2017-04-10 00:00:00.0|
|1491750901    |2017-04-09 21:00:01.0|2017-04-10 00:00:01.0|
+--------------+---------------------+---------------------+
Run Code Online (Sandbox Code Playgroud)

dataframe通过使用以下代码得到了这个

+--------------+---------------------+---------------------+
|unix_timestamp|UTC                  |Europe/Helsinki      |
+--------------+---------------------+---------------------+
|1491750899    |2017-04-09 20:59:59.0|2017-04-09 23:59:59.0|
|1491750900    |2017-04-09 21:00:00.0|2017-04-10 00:00:00.0|
|1491750901    |2017-04-09 21:00:01.0|2017-04-10 00:00:01.0|
+--------------+---------------------+---------------------+
Run Code Online (Sandbox Code Playgroud)

一旦你有以上dataframe,得到输出dataframe,你的愿望将要求您split的日期,groupbycount如下

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val inputDF = Seq(
      "2017-04-09 20:59:59",
      "2017-04-09 21:00:00",
      "2017-04-09 21:00:01"
    ).toDF("unix_timestamp")

val onlyTime = inputDF.select(
      unix_timestamp($"unix_timestamp").alias("unix_timestamp"),
      from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType),  "UTC").alias("UTC"),
      from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType),  "Europe/Helsinki").alias("Europe/Helsinki")
    )

onlyTime.show(false)
Run Code Online (Sandbox Code Playgroud)

结果dataframe

+----------+----+-----+
|date      |hour|count|
+----------+----+-----+
|2017-04-09|23  |1    |
|2017-04-10|00  |2    |
+----------+----+-----+
Run Code Online (Sandbox Code Playgroud)