Ben*_*tic 21 performance haskell ghc
我将使用以下示例程序演示此问题
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b
in b * (q - fromIntegral (floor q :: Int))
standardMap :: Double -> Point -> Point
standardMap k (Point q p) =
Point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi))
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where k = (cos (pi/3)) - (sin (pi/3))
Run Code Online (Sandbox Code Playgroud)
这standardMap k是参数化函数,k=(cos (pi/3))-(sin (pi/3))是一个参数.如果我编译这个程序与ghc -O3 -fllvm我的机器上的执行时间差不多42s,但是,如果我k在表单中写入0.5 - (sin (pi/3))执行时间等于21s,如果我写k = 0.5 - 0.5 * (sqrt 3)它只需要12s.
结论是k每次调用都会重新评估standardMap k.
为什么这不优化?
关于archlinux的PS编译器ghc 7.6.3
编辑
对于那些关注奇怪属性的人来说,standardMap这是一个更简单,更直观的例子,它表现出同样的问题
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
rotate :: Double -> Point -> Point
rotate k (Point q p) =
Point ((cos k) * q - (sin k) * p) ((sin k) * q + (cos k) * p)
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (rotate k) $ (Point 0.15 0.25)
where --k = (cos (pi/3)) - (sin (pi/3))
k = 0.5 - 0.5 * (sqrt 3)
Run Code Online (Sandbox Code Playgroud)
编辑
Before I asked the question I have tried to make k strict, the same way Don suggested, but with ghc -O3 I didn't see a difference. The solution with strictness works if the program is compiled with ghc -O2. I missed that because I didn't try all possible combinations of flags with the all possible versions of the program.
So what is the difference between -O3 and -O2 that affects such cases?
Should I prefer -O2 in general?
EDIT
As observed by Mike Hartl and others, if rotate k is changed into rotate $ k or standardMap k into standardMap $ k, the performance is improved, though it is not the best possible (Don's solution). Why?
Don*_*art 16
一如既往,检查核心.
使用ghc -O2,k被内联到循环体中,它作为顶层函数浮出:
Main.main7 :: Main.Point -> Main.Point
Main.main7 =
\ (ds_dAa :: Main.Point) ->
case ds_dAa of _ { Main.Point q_alG p_alH ->
case q_alG of _ { GHC.Types.D# x_s1bt ->
case p_alH of _ { GHC.Types.D# y_s1bw ->
case Main.$wfmod (GHC.Prim.+## x_s1bt y_s1bw) 6.283185307179586
of ww_s1bi { __DEFAULT ->
case Main.$wfmod
(GHC.Prim.+##
y_s1bw
(GHC.Prim.*##
(GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976))
(GHC.Prim.sinDouble# x_s1bt)))
6.283185307179586
of ww1_X1bZ { __DEFAULT ->
Main.Point (GHC.Types.D# ww_s1bi) (GHC.Types.D# ww1_X1bZ)
Run Code Online (Sandbox Code Playgroud)
指示在编译时不评估sin和cos调用.结果是会发生更多的数学运算:
$ time ./A
3.1430515093368085
real 0m15.590s
Run Code Online (Sandbox Code Playgroud)
如果你严格要求,每次至少不重新计算:
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where
k :: Double
!k = (cos (pi/3)) - (sin (pi/3))
Run Code Online (Sandbox Code Playgroud)
导致:
ipv_sEq =
GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976) } in
Run Code Online (Sandbox Code Playgroud)
和运行时间:
$ time ./A
6.283185307179588
real 0m7.859s
Run Code Online (Sandbox Code Playgroud)
我认为现在已经足够好了.我还将解压缩编译指示添加到Point类型.
如果您想在不同代码安排下推理数字性能,则必须检查Core.
使用您修改过的示例.它遇到了同样的问题.k内联rotate.GHC认为它真的很便宜,在这个基准测试中它更贵.
天然,ghc-7.2.3 -O2
$ time ./A
0.1470480616244365
real 0m22.897s
Run Code Online (Sandbox Code Playgroud)
并k在每次调用旋转时进行评估.
请k严格:一个办法,迫使它被不共享.
$ time ./A
0.14704806100839019
real 0m2.360s
Run Code Online (Sandbox Code Playgroud)
在Point构造函数上使用UNPACK编译指示:
$ time ./A
0.14704806100839019
real 0m1.860s
Run Code Online (Sandbox Code Playgroud)
我不认为这是重复评估.
首先,我切换到"do"符号并在"k"的定义上使用"let",我认为应该有所帮助.不 - 仍然很慢.
然后我添加了一个跟踪调用 - 只是被评估一次.甚至检查过快速变体实际上是在制作Double.
然后我打印出两种变化.起始值存在细微差别.
调整"慢"变量的值使其速度相同.我不知道你的算法是什么 - 它对起始值是否非常敏感?
import Debug.Trace (trace)
...
main = do
-- is -0.3660254037844386
let k0 = (0.5 - 0.5 * (sqrt 3))::Double
-- was -0.3660254037844385
let k1 = (cos (pi/3)) - (trace "x" (sin (pi/3))) + 0.0000000000000001;
putStrLn (show k0)
putStrLn (show k1)
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k1) $ (Point 0.15 0.25)
Run Code Online (Sandbox Code Playgroud)
编辑:这是带有数字文字的版本.它为我显示了23秒和7秒的运行时间.我编译了两个单独版本的代码,以确保我没有做一些愚蠢的事情,比如不重新编译.
main = do
-- -0.3660254037844386
-- -0.3660254037844385
let k2 = -0.3660254037844385
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k2) $ (Point 0.15 0.25)
Run Code Online (Sandbox Code Playgroud)
编辑2:我不知道如何从ghc获取操作码,但比较两个.o文件的hexdumps显示它们相差一个字节 - 可能是文字.所以它不能是运行时.
编辑3:尝试转向剖析,这让我更加困惑.除非我遗漏了一些东西,否则唯一的区别就是调用次数fmod(fmod.q准确)的差异很小.
"5"轮廓用于常数结束"5",与"6"相同.
Fri Sep 6 12:37 2013 Time and Allocation Profiling Report (Final)
constant-timings-5 +RTS -p -RTS
total time = 38.34 secs (38343 ticks @ 1000 us, 1 processor)
total alloc = 12,000,105,184 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 71.0 0.0
iterate' Main 21.2 93.3
fmod Main 6.3 6.7
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.0 0.0
main Main 102 0 1.0 0.0 1.0 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 99.0 100.0
main Main 103 0 0.0 0.0 99.0 100.0
iterate' Main 104 100000001 21.2 93.3 99.0 100.0
standardMap Main 105 100000000 71.0 0.0 77.9 6.7
fmod Main 106 200000001 6.3 6.7 6.9 6.7
fmod.q Main 109 49999750 0.6 0.0 0.6 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
Fri Sep 6 12:38 2013 Time and Allocation Profiling Report (Final)
constant-timings-6 +RTS -p -RTS
total time = 22.17 secs (22167 ticks @ 1000 us, 1 processor)
total alloc = 11,999,947,752 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 48.4 0.0
iterate' Main 38.2 93.3
fmod Main 10.9 6.7
main Main 1.4 0.0
fmod.q Main 1.0 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.4 0.0
main Main 102 0 1.4 0.0 1.4 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 98.6 100.0
main Main 103 0 0.0 0.0 98.6 100.0
iterate' Main 104 100000001 38.2 93.3 98.6 100.0
standardMap Main 105 100000000 48.4 0.0 60.4 6.7
fmod Main 106 200000001 10.9 6.7 12.0 6.7
fmod.q Main 109 49989901 1.0 0.0 1.0 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
Run Code Online (Sandbox Code Playgroud)
EDIT4:下面的链接是两个操作码转储(感谢@Tom Ellis).虽然我看不懂它们,但它们似乎有相同的"形状".据推测,长随机字符串是内部标识符.我刚刚重新编译-O2 -fforce-recomp,时间差异是真实的.
https://gist.github.com/anonymous/6462797