alb*_*lcs 5 timezone apache-spark apache-spark-sql
我需要帮助,因为我似乎迷失了时区:)
我使用Spark 1.6.2
我有这样的时代:
+--------------+-------------------+-------------------+
|unix_timestamp|UTC |Europe/Helsinki |
+--------------+-------------------+-------------------+
|1491771599 |2017-04-09 20:59:59|2017-04-09 23:59:59|
|1491771600 |2017-04-09 21:00:00|2017-04-10 00:00:00|
|1491771601 |2017-04-09 21:00:01|2017-04-10 00:00:01|
+--------------+-------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
Spark机器上的默认时区如下:
#timezone = DefaultTz:欧洲/布拉格,SparkUtilTz:欧洲/布拉格
输出
+--------------+-------------------+-------------------+
|unix_timestamp|UTC |Europe/Helsinki |
+--------------+-------------------+-------------------+
|1491771599 |2017-04-09 20:59:59|2017-04-09 23:59:59|
|1491771600 |2017-04-09 21:00:00|2017-04-10 00:00:00|
|1491771601 |2017-04-09 21:00:01|2017-04-10 00:00:01|
+--------------+-------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
我想计算给定时区中按日期和小时分组的时间戳(现在是Europe / Helsinki + 3hours)。
我的期望:
+----------+---------+-----+
|date |hour |count|
+----------+---------+-----+
|2017-04-09|23 |1 |
|2017-04-10|0 |2 |
+----------+---------+-----+
Run Code Online (Sandbox Code Playgroud)
代码(使用from_utc_timestamp):
logger.info("#timezone = DefaultTz: {}, SparkUtilTz: {}", TimeZone.getDefault.getID, org.apache.spark.sql.catalyst.util.DateTimeUtils.defaultTimeZone.getID)
Run Code Online (Sandbox Code Playgroud)
我得到了什么:'(
+----------+---------+-----+
|date |hour |count|
+----------+---------+-----+
|2017-04-09|22 |1 |
|2017-04-09|23 |2 |
+----------+---------+-----+
Run Code Online (Sandbox Code Playgroud)
尝试用包装to_utc_timestamp:
+----------+---------+-----+
|date |hour |count|
+----------+---------+-----+
|2017-04-09|23 |1 |
|2017-04-10|0 |2 |
+----------+---------+-----+
Run Code Online (Sandbox Code Playgroud)
我得到了:(
+----------+---------+-----+
|tradedate |tradehour|count|
+----------+---------+-----+
|2017-04-09|20 |1 |
|2017-04-09|21 |2 |
+----------+---------+-----+
Run Code Online (Sandbox Code Playgroud)
您知道什么是正确的解决方案吗?
在此先感谢您的帮助
你的代码对我不起作用,所以我无法复制你得到的最后两个输出。
但我将为您提供一些有关如何实现预期输出的提示
我假设你已经dataframe为
+--------------+---------------------+---------------------+
|unix_timestamp|UTC |Europe/Helsinki |
+--------------+---------------------+---------------------+
|1491750899 |2017-04-09 20:59:59.0|2017-04-09 23:59:59.0|
|1491750900 |2017-04-09 21:00:00.0|2017-04-10 00:00:00.0|
|1491750901 |2017-04-09 21:00:01.0|2017-04-10 00:00:01.0|
+--------------+---------------------+---------------------+
Run Code Online (Sandbox Code Playgroud)
我dataframe通过使用以下代码得到了这个
+--------------+---------------------+---------------------+
|unix_timestamp|UTC |Europe/Helsinki |
+--------------+---------------------+---------------------+
|1491750899 |2017-04-09 20:59:59.0|2017-04-09 23:59:59.0|
|1491750900 |2017-04-09 21:00:00.0|2017-04-10 00:00:00.0|
|1491750901 |2017-04-09 21:00:01.0|2017-04-10 00:00:01.0|
+--------------+---------------------+---------------------+
Run Code Online (Sandbox Code Playgroud)
一旦你有以上dataframe,得到输出dataframe,你的愿望将要求您split的日期,groupby并count如下
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val inputDF = Seq(
"2017-04-09 20:59:59",
"2017-04-09 21:00:00",
"2017-04-09 21:00:01"
).toDF("unix_timestamp")
val onlyTime = inputDF.select(
unix_timestamp($"unix_timestamp").alias("unix_timestamp"),
from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), "UTC").alias("UTC"),
from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), "Europe/Helsinki").alias("Europe/Helsinki")
)
onlyTime.show(false)
Run Code Online (Sandbox Code Playgroud)
结果dataframe是
+----------+----+-----+
|date |hour|count|
+----------+----+-----+
|2017-04-09|23 |1 |
|2017-04-10|00 |2 |
+----------+----+-----+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
8520 次 |
| 最近记录: |