Haskell可以像Clang/GCC一样优化函数调用吗?

Woj*_*ilo 14 c++ optimization performance haskell llvm

我想问你Haskell和C++编译器是否可以用同样的方式优化函数调用.请看下面的代码.在以下示例中,Haskell明显快于C++.

我听说Haskell可以编译为LLVM,可以通过LLVM传递进行优化.另外我听说Haskell有一些重要的优化.但是以下示例应该能够以相同的性能工作.我想问一下:

  1. 为什么我在C++中的示例基准测试比Haskell中的基准测试慢?
  2. 是否有可能进一步优化代码?

(我使用的是LLVM-3.2和GHC-7.6).

C++代码:

#include <cstdio>
#include <cstdlib>

int b(const int x){
    return x+5;
}

int c(const int x){
    return b(x)+1;
}

int d(const int x){
    return b(x)-1;
}

int a(const int x){
    return c(x) + d(x);
}

int main(int argc, char* argv[]){
    printf("Starting...\n");
    long int iternum = atol(argv[1]);
    long long int out = 0;
    for(long int i=1; i<=iternum;i++){
        out += a(iternum-i);
    }
    printf("%lld\n",out);
    printf("Done.\n");
}
Run Code Online (Sandbox Code Playgroud)

用.编译 clang++ -O3 main.cpp

haskell代码:

module Main where
import qualified Data.Vector as V
import System.Environment
b :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1
a x = c x + d x
main = do
   putStrLn "Starting..."
   args <- getArgs
   let iternum = read (head args) :: Int in do
      putStrLn $ show $ V.foldl' (+) 0 $ V.map (\i -> a (iternum-i))
         $ V.enumFromTo 1 iternum
      putStrLn "Done."
Run Code Online (Sandbox Code Playgroud)

用.编译 ghc -O3 --make -fforce-recomp -fllvm ghc-test.hs

速度结果:

在此输入图像描述

Running testcase for program 'cpp/a.out'
-------------------
cpp/a.out 100000000      0.0%   avg time: 105.05 ms 
cpp/a.out 200000000      11.11% avg time: 207.49 ms 
cpp/a.out 300000000      22.22% avg time: 309.22 ms 
cpp/a.out 400000000      33.33% avg time: 411.7 ms 
cpp/a.out 500000000      44.44% avg time: 514.07 ms 
cpp/a.out 600000000      55.56% avg time: 616.7 ms 
cpp/a.out 700000000      66.67% avg time: 718.69 ms
cpp/a.out 800000000      77.78% avg time: 821.32 ms 
cpp/a.out 900000000      88.89% avg time: 923.18 ms 
cpp/a.out 1000000000     100.0% avg time: 1025.43 ms


Running testcase for program 'hs/main'
-------------------
hs/main 100000000    0.0%   avg time: 70.97 ms (diff: 34.08)
hs/main 200000000    11.11% avg time: 138.95 ms (diff: 68.54)
hs/main 300000000    22.22% avg time: 206.3 ms (diff: 102.92)
hs/main 400000000    33.33% avg time: 274.31 ms (diff: 137.39)
hs/main 500000000    44.44% avg time: 342.34 ms (diff: 171.73)
hs/main 600000000    55.56% avg time: 410.65 ms (diff: 206.05)
hs/main 700000000    66.67% avg time: 478.25 ms (diff: 240.44)
hs/main 800000000    77.78% avg time: 546.39 ms (diff: 274.93)
hs/main 900000000    88.89% avg time: 614.12 ms (diff: 309.06)
hs/main 1000000000   100.0% avg time: 682.32 ms (diff: 343.11)
Run Code Online (Sandbox Code Playgroud)

编辑 当然,我们无法比较语言的速度,但实现的速度.

但我很好奇Ghc和C++编译器是否能以同样的方式优化函数调用

我根据你的帮助用新的基准和代码编辑了这个问题:)

sab*_*uma 21

如果你的目标是让它像C++编译器一样快速运行,那么你会希望使用编译器可以使用的数据结构.

module Main where

import qualified Data.Vector as V

b :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1

a x = c x + d x

main = do
    putStrLn "Starting..."
    putStrLn $ show $ V.foldl' (+) 0 $ V.map a $ V.enumFromTo 1 100000000
    putStrLn "Done."
Run Code Online (Sandbox Code Playgroud)

GHC能够完全消除循环,只需在生成的组件中插入一个常量.在我的计算机上,当使用与最初指定的相同的优化标志时,现在它的运行时间<0.002s.

作为基于@Yuras评论的后续工作,基于矢量的解决方案和流融合解决方案产生的核心在功能上是相同的.

向量

main_$s$wfoldlM'_loop [Occ=LoopBreaker]
  :: Int# -> Int# -> Int#

main_$s$wfoldlM'_loop =
  \ (sc_s2hW :: Int#) (sc1_s2hX :: Int#) ->
    case <=# sc1_s2hX 100000000 of _ {
      False -> sc_s2hW;
      True ->
        main_$s$wfoldlM'_loop
          (+#
             sc_s2hW
             (+#
                (+# (+# sc1_s2hX 5) 1)
                (-# (+# sc1_s2hX 5) 1)))
          (+# sc1_s2hX 1)
    }
Run Code Online (Sandbox Code Playgroud)

流融合

$wloop_foldl [Occ=LoopBreaker]
  :: Int# -> Int# -> Int#

$wloop_foldl =
  \ (ww_s1Rm :: Int#) (ww1_s1Rs :: Int#) ->
    case ># ww1_s1Rs 100000000 of _ {
      False ->
        $wloop_foldl
          (+#
             ww_s1Rm
             (+#
                (+# (+# ww1_s1Rs 5) 1)
                (-# (+# ww1_s1Rs 5) 1)))
          (+# ww1_s1Rs 1);
      True -> ww_s1Rm
    }
Run Code Online (Sandbox Code Playgroud)

唯一真正的区别是终止条件的比较操作的选择.两个版本都编译为紧密的尾递归循环,可以通过LLVM轻松优化.

  • @Yuras,这可能是因为它现在已经尽可能快了.可能性在两个版本中,GHC + LLVM能够在编译时计算结果并将其插入到程序集中.如果没有LLVM,结果不会预先计算,但它们可能会产生等效的循环. (2认同)

Yur*_*ras 10

ghc不会融合列表(不惜一切代价避免成功?)

这是使用流融合包的版本:

module Main where

import Prelude hiding (map, foldl)
import Data.List.Stream
import Data.Stream (enumFromToInt, unstream)
import Text.Printf
import Control.Exception
import System.CPUTime

b :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1

a x = c x + d x

main = do
    putStrLn "Starting..."
    putStrLn $ show $ foldl (+) 0 $ map (\z -> a z) $ unstream $ enumFromToInt 1 100000000 
    putStrLn "Done."
Run Code Online (Sandbox Code Playgroud)

我没有安装llvm来与你的结果进行比较,但它比你的版本快10倍(没有llvm编译).

我认为矢量融合的表现应该更快.

  • 说GHC没有融合列表并不完全正确.GHC使用[shortcut fusion](http://www.haskell.org/haskellwiki/Correctness_of_short_cut_fusion)来消除一些中间体.基于[this](http://www.haskell.org/ghc/docs/7.0.1/html/users_guide/rewrite-rules.html),我不认为`foldl`被认为是一个好消费者,不能结果融合成一个循环.流融合和载体文库显然可以融合折叠. (3认同)

asm*_*asm 7

正如其他人所指出的那样,你不是在比较等效的算法.正如Yuras所指出的那样, GHC并没有融合列表.你的Haskell版本实际上将分配整个列表,它将一次懒惰地完成一个单元格,但它将完成.下面的版本在算法上更接近您的C版本.在我的系统上,它与C版本同时运行.

{-# LANGUAGE BangPatterns #-}
module Main where

import Text.Printf
import Control.Exception
import System.CPUTime
import Data.List

a,b,c :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1

a !x = c x + d x

-- Don't allocate a list, iterate and increment as the C version does.
applyTo !acc !n
    | n > 100000000 = acc
    | otherwise = applyTo (acc + a n) (n + 1)

main = do
    putStrLn "Starting..."
    print $ applyTo 0 1
    putStrLn "Done."
Run Code Online (Sandbox Code Playgroud)

比较它time:

    ghc -O3 bench.hs -fllvm -fforce-recomp -o bench-hs && time ./bench-hs
[1 of 1] Compiling Main             ( bench.hs, bench.o )
Linking bench-hs ...
Starting...
10000001100000000
Done.
./bench-hs  0.00s user 0.00s system 0% cpu 0.003 total
Run Code Online (Sandbox Code Playgroud)

与C相比:

clang++ -O3 bench.cpp -o bench && time ./bench                                   
Starting...
10000001100000000
Done.
./bench  0.00s user 0.00s system 0% cpu 0.004 total
Run Code Online (Sandbox Code Playgroud)