pea*_*tus 1 sql apache-spark apache-spark-sql
我正在尝试将两个时间戳之间的分钟差异转换为MM/dd/yyyy hh:mm:ss AM/PM. 我刚开始使用 SparkSQL 并尝试使用datediff其他 SQL 语法支持的基本函数 Ie datediff(minute,start_time,end_time),但这产生了错误:
org.apache.spark.sql.AnalysisException: cannot resolve '`minute`' given input columns: [taxisub.tpep_dropoff_datetime, taxisub.DOLocationID, taxisub.improvement_surcharge, taxisub.VendorID, taxisub.trip_distance, taxisub.tip_amount, taxisub.tolls_amount, taxisub.payment_type, taxisub.fare_amount, taxisub.tpep_pickup_datetime, taxisub.total_amount, taxisub.store_and_fwd_flag, taxisub.extra, taxisub.passenger_count, taxisub.PULocationID, taxisub.mta_tax, taxisub.RatecodeID]; line 1 pos 153;
Run Code Online (Sandbox Code Playgroud)
似乎minutesparkSQL 的 datediff 不支持该参数。我目前的查询是:
spark.sqlContext.sql("Select to_timestamp(tpep_pickup_datetime,'MM/dd/yyyy hh:mm:ss') as pickup,to_timestamp(tpep_dropoff_datetime,'MM/dd/yyyy hh:mm:ss') as dropoff, datediff(to_timestamp(tpep_pickup_datetime,'MM/dd/yyyy hh:mm:ss'),to_timestamp(tpep_dropoff_datetime,'MM/dd/yyyy hh:mm:ss')) as diff from taxisub ").show()
Run Code Online (Sandbox Code Playgroud)
我的结果是:
+-------------------+-------------------+----+
| pickup| dropoff|diff|
+-------------------+-------------------+----+
|2018-12-15 08:53:20|2018-12-15 08:57:57| 0|
|2018-12-15 08:03:08|2018-12-15 08:07:30| 0|
|2018-12-15 08:28:34|2018-12-15 08:33:31| 0|
|2018-12-15 08:37:53|2018-12-15 08:43:47| 0|
|2018-12-15 08:51:02|2018-12-15 08:55:54| 0|
|2018-12-15 08:03:47|2018-12-15 08:03:50| 0|
|2018-12-15 08:45:21|2018-12-15 08:57:08| 0|
|2018-12-15 08:04:47|2018-12-15 08:29:05| 0|
|2018-12-15 08:01:22|2018-12-15 08:12:15| 0|
+-------------------+-------------------+----+
Run Code Online (Sandbox Code Playgroud)
datediff鉴于结果中的 0,我假设默认值是天数差异。我应该使用其他参数/函数来确定这两个时间戳之间的分钟差异吗?
提前致谢。
在 Spark sql 中有两种方法可以做到这一点。您将时间戳列转换为 bigint,然后减去并除以 60 是否可以直接转换为 unix_timestamp 然后减去并除以 60 以获得结果。我使用了上面数据框中的拾取和丢弃列。(在 pyspark/scala spark 中,bigint 很长)
spark.sqlContext.sql("""select pickup, dropoff, (unix_timestamp(dropoff)-unix_timestamp(pickup))/(60) as diff from taxisub""").show()
Run Code Online (Sandbox Code Playgroud)
spark.sqlContext.sql("""select pickup, dropoff, ((bigint(to_timestamp(dropoff)))-(bigint(to_timestamp(pickup))))/(60) as diff from taxisub""").show()
Run Code Online (Sandbox Code Playgroud)
输出:
+-------------------+-------------------+------------------+
| pickup| dropoff| diff|
+-------------------+-------------------+------------------+
|2018-12-15 08:53:20|2018-12-15 08:57:57| 4.616666666666666|
|2018-12-15 08:03:08|2018-12-15 08:07:30| 4.366666666666666|
|2018-12-15 08:28:34|2018-12-15 08:33:31| 4.95|
|2018-12-15 08:37:53|2018-12-15 08:43:47| 5.9|
|2018-12-15 08:51:02|2018-12-15 08:55:54| 4.866666666666666|
|2018-12-15 08:03:47|2018-12-15 08:03:50| 0.05|
|2018-12-15 08:45:21|2018-12-15 08:57:08|11.783333333333333|
|2018-12-15 08:04:47|2018-12-15 08:29:05| 24.3|
|2018-12-15 08:01:22|2018-12-15 08:12:15|10.883333333333333|
+-------------------+-------------------+------------------+
Run Code Online (Sandbox Code Playgroud)