ema*_*max 15 python sql pyspark
我有一个如下表
df
+------------------------------------+-----------------------+
|identifier |timestamp |
+------------------------------------+-----------------------+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|
+------------------------------------+-----------------------+
tmp.printSchema()
root
|-- identifier: string (nullable = true)
|-- timestamp: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
我想要一个只包含时间戳中的日期和小时的列。
我正在尝试以下操作:
from pyspark.sql.functions import hour
df = df.withColumn("hour", hour(col("timestamp")))
Run Code Online (Sandbox Code Playgroud)
但我得到以下结果
+--------------------+--------------------+----+
| identifier| timestamp|hour|
+--------------------+--------------------+----+
|321869c3-71e5-41d...|2020-03-19 03:34:...|null|
|226b8d50-2c6a-471...|2020-03-19 02:59:...|null|
|47818b7c-34b5-43c...|2020-03-19 01:41:...|null|
|f5ca5599-7252-49d...|2020-03-19 04:25:...|null|
|add2ae24-aa7b-4d3...|2020-03-19 01:50:...|null|
+--------------------+--------------------+----+
Run Code Online (Sandbox Code Playgroud)
虽然我想拥有
+--------------------+--------------------+-------------------+
| identifier| timestamp|hour |
+--------------------+--------------------+-------------------+
|321869c3-71e5-41d...|2020-03-19 03:00:...|2020-03-19 03:00:00|
|226b8d50-2c6a-471...|2020-03-19 02:59:...|2020-03-19 02:00:00|
|47818b7c-34b5-43c...|2020-03-19 01:41:...|2020-03-19 01:00:00|
|f5ca5599-7252-49d...|2020-03-19 04:25:...|2020-03-19 04:00:00|
|add2ae24-aa7b-4d3...|2020-03-19 01:50:...|2020-03-19 01:00:00|
+--------------------+--------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
mur*_*ash 13
您应该使用 pyspark 内置函数date_trunc来截断为hour. 您还可以截断为日/月/年等。
from pyspark.sql import functions as F
df.withColumn("hour", F.date_trunc('hour',F.to_timestamp("timestamp","yyyy-MM-dd HH:mm:ss 'UTC'")))\
.show(truncate=False)
+------------------------------------+-----------------------+-------------------+
|identifier |timestamp |hour |
+------------------------------------+-----------------------+-------------------+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|2020-03-18 14:00:00|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|2020-03-13 15:00:00|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|2020-03-16 12:00:00|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|2020-03-18 22:00:00|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|2020-03-17 18:00:00|
+------------------------------------+-----------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
您要求获取日期和时间,您可以使用 pyspark 提供的函数仅提取日期和时间,如下所示:
3个步骤:
代码如下所示:
from pyspark.sql.functions import *
# Step 1: transform to the correct col format
df = df.withColumn("timestamp", to_timestamp("timestamp", 'yyyy-MM-dd HH:mm:ss'))
# Step 2 & 3: Extract the needed information
df = df.withColumn('Date', date(df.timestamp))
df = df.withColumn('Hour', hour(df.timestamp))
# Display the result
df.show(1, False)
#+----------+--------------------+-------------------+-------------------+
#|identifier| timestamp| Date| Hour|
#+----------+--------------------+-------------------+-------------------+
#| 1|2020-03-19 03:00:...| 19| 03|
#+----------+--------------------+-------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
小时 col 看起来与您所描述的不完全一样,因为上面的 notNull 已经回答了它。例如,如果您只想获取日期和小时数以便稍后进行分组或聚合,则这是另一种方法。
使用from_unixtime and unix_timestamp函数来从(或)类型hour中提取小时值timestampstring(yyyy-MM-dd HH:mm:ss)
from pyspark.sql.functions import *
#sample data
df.show(truncate=False)
#+----------+-----------------------+
#|identifier|timestamp |
#+----------+-----------------------+
#|1 |2020-03-18 14:41:55 UTC|
#+----------+-----------------------+
#DataFrame[identifier: string, timestamp: string]
df.withColumn("hour", from_unixtime(unix_timestamp(col("timestamp"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00")).show()
#+----------+--------------------+-------------------+
#|identifier| timestamp| hour|
#+----------+--------------------+-------------------+
#| 1|2020-03-18 14:41:...|2020-03-18 14:00:00|
#+----------+--------------------+-------------------+
Run Code Online (Sandbox Code Playgroud)
Usage of hour function:
#on string type
spark.sql("select hour('2020-03-04 12:34:34')").show()
#on timestamp type
spark.sql("select hour(timestamp('2020-03-04 12:34:34'))").show()
#+---+
#|_c0|
#+---+
#| 12|
#+---+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
35862 次 |
| 最近记录: |