如何在数据框中提取年份和周数并将其放在新列python中

che*_*ens 2 python pandas

我有以下数据帧:

sale_id      created_at
1               2016-05-28T05:53:31.042Z
2               2016-05-30T12:50:58.184Z
3               2016-05-23T10:22:18.858Z
4               2016-05-27T09:20:15.158Z
5               2016-05-21T08:30:17.337Z
6               2016-05-28T07:41:14.361Z
Run Code Online (Sandbox Code Playgroud)

我需要添加一年周列,其中包含created_at列中每行的年份和周数:

sale_id      created_at                      year_week
1               2016-05-28T05:53:31.042Z       2016-21
2               2016-05-30T12:50:58.184Z       2016-22
3               2016-05-23T10:22:18.858Z       2016-21
4               2016-05-27T09:20:15.158Z       2016-21
5               2016-05-21T08:30:17.337Z       2016-20
6               2016-05-28T07:41:14.361Z       2016-21
Run Code Online (Sandbox Code Playgroud)

我更喜欢一种可以轻松转移到pyspark的解决方案.

Max*_*axU 5

更新: PySpark DF解决方案:

from pyspark.sql.functions import *

df.withColumn('year_week', df.select(date_format('created_at', 'yyyy-w'))
Run Code Online (Sandbox Code Playgroud)

老熊猫解决方案:

尝试这个:

df['year_week'] = df.created_at.dt.year.astype(str) + '-' + df.created_at.dt.weekofyear.astype(str)

In [29]: df
Out[29]:
   sale_id              created_at year_week
0        1 2016-05-28 05:53:31.042   2016-21
1        2 2016-05-30 12:50:58.184   2016-22
2        3 2016-05-23 10:22:18.858   2016-21
3        4 2016-05-27 09:20:15.158   2016-21
4        5 2016-05-21 08:30:17.337   2016-20
5        6 2016-05-28 07:41:14.361   2016-21
Run Code Online (Sandbox Code Playgroud)

针对60万行DF的计时:

In [33]: df = pd.concat([df] * 10**5, ignore_index=True)

In [34]: %timeit df.created_at.dt.strftime('%Y-%U')
1 loop, best of 3: 16.1 s per loop

In [35]: %timeit df.created_at.dt.year.astype(str) + '-' + df.created_at.dt.weekofyear.astype(str)
1 loop, best of 3: 7.43 s per loop

In [43]: %timeit df.created_at.dt.year.astype(str) + '-' + df.created_at.dt.week.astype(str)
1 loop, best of 3: 7.45 s per loop

In [36]: df.shape
Out[36]: (600000, 2)
Run Code Online (Sandbox Code Playgroud)


jez*_*ael 5

你可以使用strftime:

Python的strftime指令.

#if dtype is not datetime
df.created_at = pd.to_datetime(df.created_at)

df['year_week'] = df.created_at.dt.strftime('%Y-%U')
print (df)
   sale_id              created_at year_week
0        1 2016-05-28 05:53:31.042   2016-21
1        2 2016-05-30 12:50:58.184   2016-22
2        3 2016-05-23 10:22:18.858   2016-21
3        4 2016-05-27 09:20:15.158   2016-21
4        5 2016-05-21 08:30:17.337   2016-20
5        6 2016-05-28 07:41:14.361   2016-21
Run Code Online (Sandbox Code Playgroud)

另一种解决方案:dt.yeardt.week:

df['year_week'] = df.created_at.dt.year.astype(str) + '-' +
                  df.created_at.dt.week.astype(str)
print (df)
   sale_id              created_at year_week
0        1 2016-05-28 05:53:31.042   2016-21
1        2 2016-05-30 12:50:58.184   2016-22
2        3 2016-05-23 10:22:18.858   2016-21
3        4 2016-05-27 09:20:15.158   2016-21
4        5 2016-05-21 08:30:17.337   2016-20
5        6 2016-05-28 07:41:14.361   2016-21
Run Code Online (Sandbox Code Playgroud)