读取 json 时解释 Spark 中的时间戳字段

Question

读取 json 时解释 Spark 中的时间戳字段

我正在尝试读取一个漂亮的打印 json，其中包含时间字段。我想在读取 json 本身时将时间戳列解释为时间戳字段。但是，当我时它仍然将它们读取为字符串printSchema

例如输入 json 文件 -

[{
    "time_field" : "2017-09-30 04:53:39.412496Z"
}]

Run Code Online (Sandbox Code Playgroud)

代码 -

df = spark.read.option("multiLine", "true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS'Z'").json('path_to_json_file')

Run Code Online (Sandbox Code Playgroud)

输出df.printSchema()-

root
 |-- time_field: string (nullable = true)

Run Code Online (Sandbox Code Playgroud)

我在这里缺少什么？

Answer 1

Leo*_*o C 3

我自己对选项的经验timestampFormat是，它并不像广告中所说的那样有效。我将简单地将时间字段读取为字符串并用于to_timestamp进行转换，如下所示（带有稍微概括的示例输入）：

# /path/to/jsonfile
[{
    "id": 101, "time_field": "2017-09-30 04:53:39.412496Z"
},
{
    "id": 102, "time_field": "2017-10-01 01:23:45.123456Z"
}]

Run Code Online (Sandbox Code Playgroud)

在Python中：

from pyspark.sql.functions import to_timestamp

df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")

df = df.withColumn("timestamp", to_timestamp("time_field"))

df.show(2, False)
+---+---------------------------+-------------------+
|id |time_field                 |timestamp          |
+---+---------------------------+-------------------+
|101|2017-09-30 04:53:39.412496Z|2017-09-30 04:53:39|
|102|2017-10-01 01:23:45.123456Z|2017-10-01 01:23:45|
+---+---------------------------+-------------------+

df.printSchema()
root
 |-- id: long (nullable = true)
 |-- time_field: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

Run Code Online (Sandbox Code Playgroud)

在斯卡拉中：

val df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")

df.withColumn("timestamp", to_timestamp($"time_field"))

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，2 月前
查看次数：	4119 次
最近记录：	5 年，11 月前