Haskell:对常量表达式进行不必要的重新评估

Ben*_*tic 21 performance haskell ghc

我将使用以下示例程序演示此问题

{-# LANGUAGE BangPatterns #-}

data Point = Point !Double !Double

fmod :: Double -> Double -> Double
fmod a b | a < 0     = b - fmod (abs a) b 
         | otherwise = if a < b then a 
                       else let q = a / b 
                            in b * (q - fromIntegral (floor q :: Int))

standardMap :: Double -> Point -> Point
standardMap k (Point q p) = 
   Point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi))

iterate' gen !p = p : (iterate' gen $ gen p)

main = putStrLn 
     . show 
     . (\(Point a b) -> a + b) 
     . head . drop 100000000 
     . iterate' (standardMap k) $ (Point 0.15 0.25)
    where k = (cos (pi/3)) - (sin (pi/3))
Run Code Online (Sandbox Code Playgroud)

standardMap k是参数化函数,k=(cos (pi/3))-(sin (pi/3))是一个参数.如果我编译这个程序与ghc -O3 -fllvm我的机器上的执行时间差不多42s,但是,如果我k在表单中写入0.5 - (sin (pi/3))执行时间等于21s,如果我写k = 0.5 - 0.5 * (sqrt 3)它只需要12s.

结论是k每次调用都会重新评估standardMap k.

为什么这不优化?

关于archlinux的PS编译器ghc 7.6.3

编辑

对于那些关注奇怪属性的人来说,standardMap这是一个更简单,更直观的例子,它表现出同样的问题

{-# LANGUAGE BangPatterns #-}

data Point = Point !Double !Double

rotate :: Double -> Point -> Point
rotate k (Point q p) = 
   Point ((cos k) * q - (sin k) * p) ((sin k) * q + (cos k) * p)

iterate' gen !p = p : (iterate' gen $ gen p)

main = putStrLn 
     . show 
     . (\(Point a b) -> a + b) 
     . head . drop 100000000 
     . iterate' (rotate k) $ (Point 0.15 0.25)
   where --k = (cos (pi/3)) - (sin (pi/3))
         k = 0.5 - 0.5 * (sqrt 3)
Run Code Online (Sandbox Code Playgroud)

编辑

Before I asked the question I have tried to make k strict, the same way Don suggested, but with ghc -O3 I didn't see a difference. The solution with strictness works if the program is compiled with ghc -O2. I missed that because I didn't try all possible combinations of flags with the all possible versions of the program.

So what is the difference between -O3 and -O2 that affects such cases?

Should I prefer -O2 in general?

EDIT

As observed by Mike Hartl and others, if rotate k is changed into rotate $ k or standardMap k into standardMap $ k, the performance is improved, though it is not the best possible (Don's solution). Why?

Don*_*art 16

一如既往,检查核心.

使用ghc -O2,k被内联到循环体中,它作为顶层函数浮出:

Main.main7 :: Main.Point -> Main.Point
Main.main7 =
  \ (ds_dAa :: Main.Point) ->
    case ds_dAa of _ { Main.Point q_alG p_alH ->
    case q_alG of _ { GHC.Types.D# x_s1bt ->
    case p_alH of _ { GHC.Types.D# y_s1bw ->
    case Main.$wfmod (GHC.Prim.+## x_s1bt y_s1bw) 6.283185307179586
    of ww_s1bi { __DEFAULT ->
    case Main.$wfmod
           (GHC.Prim.+##
              y_s1bw
              (GHC.Prim.*##
                 (GHC.Prim.-##
                    (GHC.Prim.cosDouble# 1.0471975511965976)
                    (GHC.Prim.sinDouble# 1.0471975511965976))
                 (GHC.Prim.sinDouble# x_s1bt)))
           6.283185307179586
    of ww1_X1bZ { __DEFAULT ->
    Main.Point (GHC.Types.D# ww_s1bi) (GHC.Types.D# ww1_X1bZ)
Run Code Online (Sandbox Code Playgroud)

指示在编译时不评估sin和cos调用.结果是会发生更多的数学运算:

$ time ./A
3.1430515093368085
real    0m15.590s
Run Code Online (Sandbox Code Playgroud)

如果你严格要求,每次至少不重新计算:

main = putStrLn
     . show
     . (\(Point a b) -> a + b)
     . head . drop 100000000
     . iterate' (standardMap k) $ (Point 0.15 0.25)

    where
      k :: Double
      !k = (cos (pi/3)) - (sin (pi/3))
Run Code Online (Sandbox Code Playgroud)

导致:

ipv_sEq =
                      GHC.Prim.-##
                        (GHC.Prim.cosDouble# 1.0471975511965976)
                        (GHC.Prim.sinDouble# 1.0471975511965976) } in
Run Code Online (Sandbox Code Playgroud)

和运行时间:

$ time ./A
6.283185307179588
real    0m7.859s
Run Code Online (Sandbox Code Playgroud)

我认为现在已经足够好了.我还将解压缩编译指示添加到Point类型.

如果您想在不同代码安排下推理数字性能,则必须检查Core.


使用您修改过的示例.它遇到了同样的问题.k内联rotate.GHC认为它真的很便宜,在这个基准测试中它更贵.

天然,ghc-7.2.3 -O2

$ time ./A
0.1470480616244365

real    0m22.897s
Run Code Online (Sandbox Code Playgroud)

k在每次调用旋转时进行评估.

k严格:一个办法,迫使它被不共享.

$ time ./A
0.14704806100839019

real    0m2.360s
Run Code Online (Sandbox Code Playgroud)

在Point构造函数上使用UNPACK编译指示:

$ time ./A
0.14704806100839019

real    0m1.860s
Run Code Online (Sandbox Code Playgroud)

  • 很好地解释了结果程序的行为,但我认为为什么`k`不共享的问题仍然是有效的. (7认同)
  • 为什么在``k``严格后结果会改变? (2认同)

Ric*_*ton 5

我不认为这是重复评估.

首先,我切换到"do"符号并在"k"的定义上使用"let",我认为应该有所帮助.不 - 仍然很慢.

然后我添加了一个跟踪调用 - 只是被评估一次.甚至检查过快速变体实际上是在制作Double.

然后我打印出两种变化.起始值存在细微差别.

调整"慢"变量的值使其速度相同.我不知道你的算法是什么 - 它对起始值是否非常敏感?

import Debug.Trace (trace)

...

main = do
    -- is -0.3660254037844386
    let k0 = (0.5 - 0.5 * (sqrt 3))::Double
    -- was -0.3660254037844385
    let k1 = (cos (pi/3)) - (trace "x" (sin (pi/3))) + 0.0000000000000001;
    putStrLn (show k0)
    putStrLn (show k1)
    putStrLn
     . show
     . (\(Point a b) -> a + b)
     . head . drop 100000000
     . iterate' (standardMap k1) $ (Point 0.15 0.25)
Run Code Online (Sandbox Code Playgroud)

编辑:这是带有数字文字的版本.它为我显示了23秒和7秒的运行时间.我编译了两个单独版本的代码,以确保我没有做一些愚蠢的事情,比如不重新编译.

main = do
    -- -0.3660254037844386
    -- -0.3660254037844385
    let k2 = -0.3660254037844385
    putStrLn
     . show
     . (\(Point a b) -> a + b)
     . head . drop 100000000
     . iterate' (standardMap k2) $ (Point 0.15 0.25)
Run Code Online (Sandbox Code Playgroud)

编辑2:我不知道如何从ghc获取操作码,但比较两个.o文件的hexdumps显示它们相差一个字节 - 可能是文字.所以它不能是运行时.


编辑3:尝试转向剖析,这让我更加困惑.除非我遗漏了一些东西,否则唯一的区别就是调用次数fmod(fmod.q准确)的差异很小.

"5"轮廓用于常数结束"5",与"6"相同.

        Fri Sep  6 12:37 2013 Time and Allocation Profiling Report  (Final)

           constant-timings-5 +RTS -p -RTS

        total time  =       38.34 secs   (38343 ticks @ 1000 us, 1 processor)
        total alloc = 12,000,105,184 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

standardMap Main     71.0    0.0
iterate'    Main     21.2   93.3
fmod        Main      6.3    6.7


                                                          individual     inherited
COST CENTRE     MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN            MAIN                     50           0    0.0    0.0   100.0  100.0                                  
 main           Main                    101           0    0.0    0.0     0.0    0.0                                  
 CAF:main1      Main                     98           0    0.0    0.0     0.0    0.0                                  
  main          Main                    100           1    0.0    0.0     0.0    0.0                                  
 CAF:main2      Main                     97           0    0.0    0.0     1.0    0.0                                  
  main          Main                    102           0    1.0    0.0     1.0    0.0                                  
   main.\       Main                    110           1    0.0    0.0     0.0    0.0                                  
 CAF:main3      Main                     96           0    0.0    0.0    99.0  100.0                                  
  main          Main                    103           0    0.0    0.0    99.0  100.0                                  
   iterate'     Main                    104   100000001   21.2   93.3    99.0  100.0                                  
    standardMap Main                    105   100000000   71.0    0.0    77.9    6.7                                  
     fmod       Main                    106   200000001    6.3    6.7     6.9    6.7                                                                                                
      fmod.q    Main                    109    49999750    0.6    0.0     0.6    0.0                                                                                                
 CAF:main_k     Main                     95           0    0.0    0.0     0.0    0.0                                                                                                
  main          Main                    107           0    0.0    0.0     0.0    0.0                                                                                                
   main.k2      Main                    108           1    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.IO.Handle.FD         93           0    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.Conc.Signal          90           0    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.Float                89           0    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.IO.Encoding          82           0    0.0    0.0     0.0    0.0                                                                                                
 CAF            GHC.IO.Encoding.Iconv    66           0    0.0    0.0     0.0    0.0 


        Fri Sep  6 12:38 2013 Time and Allocation Profiling Report  (Final)

           constant-timings-6 +RTS -p -RTS

        total time  =       22.17 secs   (22167 ticks @ 1000 us, 1 processor)
        total alloc = 11,999,947,752 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

standardMap Main     48.4    0.0
iterate'    Main     38.2   93.3
fmod        Main     10.9    6.7
main        Main      1.4    0.0
fmod.q      Main      1.0    0.0


                                                          individual     inherited
COST CENTRE     MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN            MAIN                     50           0    0.0    0.0   100.0  100.0
 main           Main                    101           0    0.0    0.0     0.0    0.0
 CAF:main1      Main                     98           0    0.0    0.0     0.0    0.0
  main          Main                    100           1    0.0    0.0     0.0    0.0
 CAF:main2      Main                     97           0    0.0    0.0     1.4    0.0
  main          Main                    102           0    1.4    0.0     1.4    0.0
   main.\       Main                    110           1    0.0    0.0     0.0    0.0
 CAF:main3      Main                     96           0    0.0    0.0    98.6  100.0
  main          Main                    103           0    0.0    0.0    98.6  100.0
   iterate'     Main                    104   100000001   38.2   93.3    98.6  100.0
    standardMap Main                    105   100000000   48.4    0.0    60.4    6.7
     fmod       Main                    106   200000001   10.9    6.7    12.0    6.7
      fmod.q    Main                    109    49989901    1.0    0.0     1.0    0.0
 CAF:main_k     Main                     95           0    0.0    0.0     0.0    0.0
  main          Main                    107           0    0.0    0.0     0.0    0.0
   main.k2      Main                    108           1    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Handle.FD         93           0    0.0    0.0     0.0    0.0
 CAF            GHC.Conc.Signal          90           0    0.0    0.0     0.0    0.0
 CAF            GHC.Float                89           0    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Encoding          82           0    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Encoding.Iconv    66           0    0.0    0.0     0.0    0.0
Run Code Online (Sandbox Code Playgroud)

EDIT4:下面的链接是两个操作码转储(感谢@Tom Ellis).虽然我看不懂它们,但它们似乎有相同的"形状".据推测,长随机字符串是内部标识符.我刚刚重新编译-O2 -fforce-recomp,时间差异是真实的. https://gist.github.com/anonymous/6462797