Sta*_*tto 34 python performance subtraction addition
我正在优化一些Python代码,并尝试了以下实验:
import time
start = time.clock()
x = 0
for i in range(10000000):
x += 1
end = time.clock()
print '+=',end-start
start = time.clock()
x = 0
for i in range(10000000):
x -= -1
end = time.clock()
print '-=',end-start
Run Code Online (Sandbox Code Playgroud)
第二个循环可靠地更快,从晶须到10%,取决于我运行它的系统.我已经尝试改变循环的顺序,执行次数等,它似乎仍然有效.
陌生人,
for i in range(10000000, 0, -1):
Run Code Online (Sandbox Code Playgroud)
(即向后运行循环)比快
for i in range(10000000):
Run Code Online (Sandbox Code Playgroud)
即使循环内容相同.
是什么给了,这里有更一般的编程课程?
Gle*_*ard 77
I can reproduce this on my Q6600 (Python 2.6.2); increasing the range to 100000000:
('+=', 11.370000000000001)
('-=', 10.769999999999998)
Run Code Online (Sandbox Code Playgroud)
First, some observations:
INPLACE_ADDvs. INPLACE_SUBTRACT和+1 vs -1.Looking at the Python source, I can make a guess. This is handled in ceval.c, in PyEval_EvalFrameEx. INPLACE_ADD has a significant extra block of code, to handle string concatenation. That block doesn't exist in INPLACE_SUBTRACT, since you can't subtract strings. That means INPLACE_ADD contains more native code. Depending (heavily!) on how the code is being generated by the compiler, this extra code may be inline with the rest of the INPLACE_ADD code, which means additions can hit the instruction cache harder than subtraction. This could be causing extra L2 cache hits, which could cause a significant performance difference.
This is heavily dependent on the system you're on (different processors have different amounts of cache and cache architectures), the compiler in use, including the particular version and compilation options (different compilers will decide differently which bits of code are on the critical path, which determines how assembly code is lumped together), and so on.
Also, the difference is reversed in Python 3.0.1 (+: 15.66, -: 16.71); no doubt this critical function has changed a lot.
pix*_*eat 13
$ python -m timeit -s "x=0" "x+=1"
10000000 loops, best of 3: 0.151 usec per loop
$ python -m timeit -s "x=0" "x-=-1"
10000000 loops, best of 3: 0.154 usec per loop
Run Code Online (Sandbox Code Playgroud)
看起来你有一些测量偏差
我认为,"总体规划的教训"是,这是真的很难预测,仅通过查看源代码,其中语句序列将是最快的.各级程序员经常被这种"直观"优化所困扰.你认为你知道的可能不一定是真的.
没有什么可以替代实际测量你的程序性能.感谢这样做; 在这种情况下,回答为什么无疑需要深入研究Python的实现.
使用字节编译语言(如Java,Python和.NET),仅仅测量一台计算机上的性能就不够了.VM版本,本机代码转换实现,CPU特定优化等之间的差异将使这类问题变得越来越难以回答.
"第二个循环可靠得快......"
那就是你的解释.重新排序脚本,以便先进行减法测试,然后添加,然后突然添加成为更快的操作:
-= 3.05
+= 2.84
Run Code Online (Sandbox Code Playgroud)
显然,脚本的后半部分会发生一些事情,使其更快.我的猜测是第一次调用range()的速度较慢,因为python需要为这么长的列表分配足够的内存,但它能够重新使用该内存进行第二次调用range():
import time
start = time.clock()
x = range(10000000)
end = time.clock()
del x
print 'first range()',end-start
start = time.clock()
x = range(10000000)
end = time.clock()
print 'second range()',end-start
Run Code Online (Sandbox Code Playgroud)
这个脚本的一些运行表明,第一个range()帐户所需的额外时间几乎涵盖了上面所见的'+ ='和' - ='之间的所有时差:
first range() 0.4
second range() 0.23
Run Code Online (Sandbox Code Playgroud)