如何在熊猫中快速处理日期?

Gh*_*KU 1 python datetime dataframe pandas

我有200000行的数据帧.每条记录都有一个时间戳,我需要按日期对它们进行分组.所以我这样做:

In [67]: df['result_date'][0]
Out[67]: Timestamp('2017-09-01 09:12:00')

In [68]: %timeit df['result_day'] = df['result_date'].apply(lambda x: str(x.date()))
2.26 s ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [69]: df['result_day'][0]
Out[69]: '2017-09-01'
Run Code Online (Sandbox Code Playgroud)

要么

In [70]: %timeit df['result_day'] = df['result_date'].apply(lambda x: x.date())
2.05 s ± 213 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [71]: df['result_day'][0]
Out[71]: datetime.date(2017, 9, 1)
Run Code Online (Sandbox Code Playgroud)

无论如何,它需要约2秒.我可以更快地完成吗?

UPD:

In [75]: df.shape
Out[75]: (228217, 18)

In [77]: %timeit df['result_date'].dt.date
1.44 s ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)

Jef*_*eff 5

使用jezrael的例子.你几乎不想真正使用.date; 这会创建python对象..normalize()将日期的时间设置为00:00:00,有效地使它们成为日期,但保持它们的高性能格式datetime64[ns].

In [32]: rng = pd.date_range('2000-04-03', periods=200000, freq='2H')
    ...: df = pd.DataFrame({'result_date': rng})  
    ...: 

In [33]: %timeit df['result_date'].dt.date
482 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]: %timeit df['result_date'].dt.normalize()
16.3 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Run Code Online (Sandbox Code Playgroud)

分组

In [39]: %timeit df.groupby(df['result_date'].dt.date).size()
506 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [40]: %timeit df.groupby(df['result_date'].dt.normalize()).size()
24.2 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Run Code Online (Sandbox Code Playgroud)

或者惯用

In [38]: %timeit df.resample('D', on='result_date').size()
5.47 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Run Code Online (Sandbox Code Playgroud)