如何在PySpark Dataframe列中将日期转换为月的第一天？

Question

如何在PySpark Dataframe列中将日期转换为月的第一天？

Rak*_*van 4 python apache-spark apache-spark-sql pyspark

我有以下DataFrame:

+----------+
|      date|
+----------+
|2017-01-25|
|2017-01-21|
|2017-01-12|
+----------+

Run Code Online (Sandbox Code Playgroud)

以下是DataFrame上面创建的代码:

import pyspark.sql.functions as f
rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)])
df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd"))
df.show()

Run Code Online (Sandbox Code Playgroud)

我想要一个新列,每行的第一个日期,只需在所有日期将日期替换为"01"

+----------++----------+
|      date| first_date|
+----------++----------+
|2017-11-25| 2017-11-01|
|2017-12-21| 2017-12-01|
|2017-09-12| 2017-09-01|
+----------+-----------+

Run Code Online (Sandbox Code Playgroud)

PySpark.sql.function中有一个last_day函数,但是没有first_day函数.

我尝试使用date_sub执行此操作但不起作用:我得到一个列而不是Iterable错误,因为date_sub的第二个参数不能是一个列而必须是一个整数.

f.date_sub(f.col('date'), f.dayofmonth(f.col('date')) - 1 )

Run Code Online (Sandbox Code Playgroud)

Answer 1

hi-*_*zir 12

你可以使用trunc:

df.withColumn("first_date", f.trunc("date", "month")).show()

+----------+----------+
|      date|first_date|
+----------+----------+
|2017-11-25|2017-11-01|
|2017-12-21|2017-12-01|
|2017-09-12|2017-09-01|
+----------+----------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，11 月前
查看次数：	5354 次
最近记录：	7 年，11 月前