Gra*_*non 3 week-number apache-spark pyspark pyspark-sql
我不太清楚为什么我的代码给出了52作为答案:weekofyear("01/JAN/2017").
有没有人对此有可能的解释?有一个更好的方法吗?
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('weekOfYear').getOrCreate()
from pyspark.sql.functions import to_date
df = spark.createDataFrame(
[(1, "01/JAN/2017"), (2, "15/FEB/2017")], ("id", "date"))
df.show()
+---+-----------+
| id| date|
+---+-----------+
| 1|01/JAN/2017|
| 2|15/FEB/2017|
+---+-----------+
Run Code Online (Sandbox Code Playgroud)
计算一年中的一周
df=df.withColumn("weekofyear", functions.weekofyear(to_date(df["date"],"dd/MMM/yyyy")))
df.printSchema()
root
|-- id: long (nullable = true)
|-- date: string (nullable = true)
|-- weekofyear: integer (nullable = true)
df.show()
Run Code Online (Sandbox Code Playgroud)
"错误"如下所示:
+---+-----------+----------+
| id| date|weekofyear|
+---+-----------+----------+
| 1|01/JAN/2017| 52|
| 2|15/FEB/2017| 7|
+---+-----------+----------+
Run Code Online (Sandbox Code Playgroud)
weekofyear()如果星期一是星期一到星期四,似乎只会在1月1日返回1.
为了确认,我创建了一个"01/JAN/YYYY"从1900年到2018年的所有DataFrame :
df = sqlCtx.createDataFrame(
[(1, "01/JAN/{y}".format(y=year),) for year in range(1900,2019)],
["id", "date"]
)
Run Code Online (Sandbox Code Playgroud)
现在让我们将它转换为日期,获取星期几,并计算以下值weekofyear():
import pyspark.sql.functions as f
df.withColumn("d", f.to_date(f.from_unixtime(f.unix_timestamp('date', "dd/MMM/yyyy"))))\
.withColumn("weekofyear", f.weekofyear("d"))\
.withColumn("dayofweek", f.date_format("d", "E"))\
.groupBy("dayofweek", "weekofyear")\
.count()\
.show()
#+---------+----------+-----+
#|dayofweek|weekofyear|count|
#+---------+----------+-----+
#| Sun| 52| 17|
#| Mon| 1| 18|
#| Tue| 1| 17|
#| Wed| 1| 17|
#| Thu| 1| 17|
#| Fri| 53| 17|
#| Sat| 53| 4|
#| Sat| 52| 12|
#+---------+----------+-----+
Run Code Online (Sandbox Code Playgroud)
注意,我使用的是Spark v 2.1,to_date()它不接受格式参数,因此我必须使用此答案中描述的方法将字符串转换为日期.
同样to_date()只返回1:
更新
此行为与ISO 8601定义一致.
| 归档时间: |
|
| 查看次数: |
1474 次 |
| 最近记录: |