Pou*_*del 3 python sql pandas apache-spark pyspark
我有一个名为Time浮点值的列,在第一个事件发生后以秒为单位给出时间。我想知道如何在 SQL 中创建称为Date和Hour使用此列的列。
我的数据集很大,我不能使用 Pandas。
import numpy as np
import pandas as pd
import pyspark
from pyspark.sql.functions import col
from pyspark.sql.functions import udf # @udf("integer") def myfunc(x,y): return x - y
from pyspark.sql import functions as F # stddev format_number date_format, dayofyear, when
spark = pyspark.sql.SparkSession.builder.appName('bhishan').getOrCreate()
Run Code Online (Sandbox Code Playgroud)
%%bash
cat > data.csv << EOL
Time
10.0
61.0
3500.00
3600.00
3700.54
7000.22
7200.22
15000.55
86400.22
EOL
Run Code Online (Sandbox Code Playgroud)
df = spark.read.csv('data.csv', header=True, inferSchema=True)
print('nrows = ', df.count(), 'ncols = ', len(df.columns))
df.show()
nrows = 9 ncols = 1
+--------+
| Time|
+--------+
| 10.0|
| 61.0|
| 3500.0|
| 3600.0|
| 3700.54|
| 7000.22|
| 7200.22|
|15000.55|
|86400.22|
+--------+
Run Code Online (Sandbox Code Playgroud)
pandas_df = df.toPandas()
pandas_df['Date'] = pd.to_datetime('2019-01-01') + pd.to_timedelta(pandas_df['Time'],unit='s')
pandas_df['hour'] = pandas_df['Date'].dt.hour
print(pandas_df)
Time Date hour
0 10.00 2019-01-01 00:00:10.000 0
1 61.00 2019-01-01 00:01:01.000 0
2 3500.00 2019-01-01 00:58:20.000 0
3 3600.00 2019-01-01 01:00:00.000 1
4 3700.54 2019-01-01 01:01:40.540 1
5 7000.22 2019-01-01 01:56:40.220 1
6 7200.22 2019-01-01 02:00:00.220 2
7 15000.55 2019-01-01 04:10:00.550 4
8 86400.22 2019-01-02 00:00:00.220 0
Run Code Online (Sandbox Code Playgroud)
如何获取新列Date 并Hour使用 SQL 和 Pyspark,就像我刚刚在 Pandas 中所做的那样。我有无法使用熊猫的大数据,为此我必须使用 pyspark。谢谢。
您可以使用函数:timestamp、unix_timestamp和hour:
from pyspark.sql.functions import expr, hour
df.withColumn('Date', expr("timestamp(unix_timestamp('2019-01-01 00:00:00') + Time)")) \
.withColumn('hour', hour('Date')) \
.show(truncate=False)
+--------+----------------------+----+
|Time |Date |hour|
+--------+----------------------+----+
|10.0 |2019-01-01 00:00:10 |0 |
|61.0 |2019-01-01 00:01:01 |0 |
|3500.0 |2019-01-01 00:58:20 |0 |
|3600.0 |2019-01-01 01:00:00 |1 |
|3700.54 |2019-01-01 01:01:40.54|1 |
|7000.22 |2019-01-01 01:56:40.22|1 |
|7200.22 |2019-01-01 02:00:00.22|2 |
|15000.55|2019-01-01 04:10:00.55|4 |
|86400.22|2019-01-02 00:00:00.22|0 |
+--------+----------------------+----+
Run Code Online (Sandbox Code Playgroud)
注意:使用时间戳函数来保持微秒
使用 SQL 语法:
df.createOrReplaceTempView('t_df')
spark.sql("""
WITH d AS (SELECT *, timestamp(unix_timestamp('2019-01-01 00:00:00') + Time) as Date FROM t_df)
SELECT *, hour(d.Date) AS hour FROM d
""").show(truncate=False)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
162 次 |
| 最近记录: |