Gre*_*ver 5 python performance datetime numpy pandas
我似乎在pandas.Timestamp与python常规datetime()对象上遇到意外缓慢的算术运算性能.
以下是一个基准测试:
import datetime
import pandas
import numpy
# using datetime:
def test1():
d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
d2 = datetime.datetime(2015, 3, 20, 10, 0, 15)
delta = datetime.timedelta(minutes=30)
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
# using pandas:
def test2():
d1 = pandas.datetime(2015, 3, 20, 10, 0, 0)
d2 = pandas.datetime(2015, 3, 20, 10, 0, 15)
delta = pandas.Timedelta(minutes=30)
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
# using numpy
def test3():
d1 = numpy.datetime64('2015-03-20 10:00:00')
d2 = numpy.datetime64('2015-03-20 10:00:15')
delta = numpy.timedelta64(30, 'm')
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
time1 = datetime.datetime.now()
test1()
time2 = datetime.datetime.now()
test2()
time3 = datetime.datetime.now()
test3()
time4 = datetime.datetime.now()
print('DELTA test1: ' + str(time2-time1))
print('DELTA test2: ' + str(time3-time2))
print('DELTA test3: ' + str(time4-time3))
Run Code Online (Sandbox Code Playgroud)
我机器上的相应结果(python3.3,pandas 0.15.2):
DELTA test1: 0:00:00.131698
DELTA test2: 0:00:10.034970
DELTA test3: 0:00:05.233389
Run Code Online (Sandbox Code Playgroud)
这是预期的吗?
除了尽可能将代码切换到Python的默认日期时间实现之外,还有其他方法可以消除性能问题吗?
我在我的机器上得到了类似的结果:
$ python -mtimeit -s "from datetime import datetime, timedelta; d1, d2 = datetime(2015, 3, 20, 10, 0, 0), datetime(2015, 3, 20, 10, 0, 15); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.107 usec per loop
$ python -mtimeit -s "from numpy import datetime64, timedelta64; d1, d2 = datetime64('2015-03-20T10:00:00Z'), datetime64('2015-03-20T10:00:15Z'); delta = timedelta64(30, 'm')" "(d2 - d1) > delta"
100000 loops, best of 3: 5.35 usec per loop
$ python -mtimeit -s "from pandas import Timestamp, Timedelta; d1, d2 = Timestamp('2015-03-20T10:00:00Z'), Timestamp('2015-03-20T10:00:15Z'); delta = Timedelta(minutes=30)" "(d2 - d1) > delta"
10000 loops, best of 3: 19.9 usec per loop
Run Code Online (Sandbox Code Playgroud)
datetime
numpy
比相应的类似物快几倍pandas
。
$ python -c "import numpy, pandas; print(numpy.__version__, pandas.__version__)"
('1.9.2', '0.15.2')
Run Code Online (Sandbox Code Playgroud)
目前尚不清楚为什么差异如此之大。确实numpy
,pandas
代码针对矢量化操作进行了优化。但是,为什么这些特定的标量操作慢两个数量级并不明显,例如,添加显式时区不会减慢datetime.datetime
代码速度:
$ python3 -mtimeit -s "from datetime import datetime, timedelta, timezone; d1, d2 = datetime(2015, 3, 20, 10, 0, 0, tzinfo=timezone.utc), datetime(2015, 3, 20, 10, 0, 15, tzinfo=timezone.utc); delta = timedelta(minutes=30)" "(d2 - d1) > delta"
10000000 loops, best of 3: 0.0939 usec per loop
Run Code Online (Sandbox Code Playgroud)
要解决此问题,如果无法使用矢量化运算,您可以尝试将本机日期/时间类型全部转换为更简单(更快)的类似物(例如,表示为浮点数的 POSIX 时间戳)。