这是我之前的一个问题的后续问题,我在其中询问为什么流融合不是在某个程序中踢.结果证明问题在于某些函数没有内联,并且一个INLINE标志通过about改进了性能17x(这表明了内联的重要性!).
现在,请注意,在最初的问题上,我立刻硬编码64了incAll.现在,假设我创建了一个nTimes函数,它重复调用一个函数:
module Main where
import qualified Data.Vector.Unboxed as V
{-# INLINE incAll #-}
incAll :: V.Vector Int -> V.Vector Int
incAll = V.map (+ 1)
{-# INLINE nTimes #-}
nTimes :: Int -> (a -> a) -> a -> a
nTimes 0 f x = x
nTimes n f x = f (nTimes (n-1) f x)
main :: IO ()
main = do
let size = 100000000 :: Int
let array = V.replicate size 0 :: V.Vector Int
print $ V.sum (nTimes 64 incAll array)
Run Code Online (Sandbox Code Playgroud)
在这种情况下,只添加一个INLINEpragma nTimes将无济于事,因为AFAIK GHC不会内联递归函数.是否有任何技巧迫使GHC nTimes在编译时扩展,从而恢复预期的性能?
Zet*_*eta 27
不,但你可以使用更好的功能.我不是在谈论V.map (+64),这会让事情变得更快,但是nTimes.我们有三个候选人已经做了以下事情nTimes:
{-# INLINE nTimesFoldr #-}
nTimesFoldr :: Int -> (a -> a) -> a -> a
nTimesFoldr n f x = foldr (.) id (replicate n f) $ x
{-# INLINE nTimesIterate #-}
nTimesIterate :: Int -> (a -> a) -> a -> a
nTimesIterate n f x = iterate f x !! n
{-# INLINE nTimesTail #-}
nTimesTail :: Int -> (a -> a) -> a -> a
nTimesTail n f = go n
where
{-# INLINE go #-}
go n x | n <= 0 = x
go n x = go (n - 1) (f x)
Run Code Online (Sandbox Code Playgroud)
所有版本大约需要8秒,而版本需要40秒.顺便说一句,约阿希姆的版本也需要8秒.请注意,该iterate版本在我的系统上占用更多内存.虽然GHC 有一个unroll插件,但它在过去五年中没有更新(它使用自定义ANNotations).
然而,在我们绝望之前,GHC实际上是如何试图内联一切的?让我们使用nTimesTail和nTimes 1:
module Main where
import qualified Data.Vector.Unboxed as V
{-# INLINE incAll #-}
incAll :: V.Vector Int -> V.Vector Int
incAll = V.map (+ 1)
{-# INLINE nTimes #-}
nTimes :: Int -> (a -> a) -> a -> a
nTimes n f = go n
where
{-# INLINE go #-}
go n x | n <= 0 = x
go n x = go (n - 1) (f x)
main :: IO ()
main = do
let size = 100000000 :: Int
let array = V.replicate size 0 :: V.Vector Int
print $ V.sum (nTimes 1 incAll array)
Run Code Online (Sandbox Code Playgroud)
$ stack ghc --package vector -- -O2 -ddump-simpl -dsuppress-all SO.hs
Run Code Online (Sandbox Code Playgroud)
main2 =
case (runSTRep main3) `cast` ...
of _ { Vector ww1_s9vw ww2_s9vx ww3_s9vy ->
case $wgo 1 ww1_s9vw ww2_s9vx ww3_s9vy
of _ { (# ww5_s9w3, ww6_s9w4, ww7_s9w5 #) ->
Run Code Online (Sandbox Code Playgroud)
我们可以在那里停下来.$wgo是go上面定义的.即使1GHC没有展开循环.这是令人不安的,因为它1是一个常数.
但是,唉,这并非一无所获.如果C++程序员能够为编译时常量执行以下操作,那么我们也应该这样做吧?
template <int N>
struct Call{
template <class F, class T>
static T call(F f, T && t){
return f(Call<N-1>::call(f,std::forward<T>(t)));
}
};
template <>
struct Call<0>{
template <class F, class T>
static T call(F f, T && t){
return t;
}
};
Run Code Online (Sandbox Code Playgroud)
当然,我们可以使用TemplateHaskell *:
-- Times.sh
{-# LANGUAGE TemplateHaskell #-}
module Times where
import Control.Monad (when)
import Language.Haskell.TH
nTimesTH :: Int -> Q Exp
nTimesTH n = do
f <- newName "f"
x <- newName "x"
when (n <= 0) (reportWarning "nTimesTH: argument non-positive")
let go k | k <= 0 = VarE x
go k = AppE (VarE f) (go (k - 1))
return $ LamE [VarP f,VarP x] (go n)
Run Code Online (Sandbox Code Playgroud)
怎么nTimesTH办?它创建了一个新函数,其中第一个名称f将x总共应用于第二个名称n.n现在需要是一个适合我们的编译时常量,因为只有编译时常量才能进行循环展开:
$(nTimesTH 0) = \f x -> x
$(nTimesTH 1) = \f x -> f x
$(nTimesTH 2) = \f x -> f (f x)
$(nTimesTH 3) = \f x -> f (f (f x))
...
Run Code Online (Sandbox Code Playgroud)
它有用吗?它快吗?比较快多快nTimes?让我们尝试另一个main:
-- SO.hs
{-# LANGUAGE TemplateHaskell #-}
module Main where
import Times
import qualified Data.Vector.Unboxed as V
{-# INLINE incAll #-}
incAll :: V.Vector Int -> V.Vector Int
incAll = V.map (+ 1)
{-# INLINE nTimes #-}
nTimes :: Int -> (a -> a) -> a -> a
nTimes n f = go n
where
{-# INLINE go #-}
go n x | n <= 0 = x
go n x = go (n - 1) (f x)
main :: IO ()
main = do
let size = 100000000 :: Int
let array = V.replicate size 0 :: V.Vector Int
let vTH = V.sum ($(nTimesTH 64) incAll array)
let vNorm = V.sum (nTimes 64 incAll array)
print $ vTH == vNorm
Run Code Online (Sandbox Code Playgroud)
stack ghc --package vector -- -O2 SO.hs && SO.exe +RTS -t
Run Code Online (Sandbox Code Playgroud)
True
<<ghc: 52000056768 bytes, 66 GCs, 400034700/800026736 avg/max bytes residency (2 samples), 1527M in use, 0.000 INIT (0.000 elapsed), 8.875 MUT (9.119 elapsed), 0.000 GC (0.094 elapsed) :ghc>>
Run Code Online (Sandbox Code Playgroud)
它产生了正确的结果.它有多快?让我们main再次使用另一个:
main :: IO ()
main = do
let size = 100000000 :: Int
let array = V.replicate size 0 :: V.Vector Int
print $ V.sum ($(nTimesTH 64) incAll array)
Run Code Online (Sandbox Code Playgroud)
800,048,112 bytes allocated in the heap
4,352 bytes copied during GC
42,664 bytes maximum residency (1 sample(s))
18,776 bytes maximum slop
764 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.049s 0.0488s 0.0488s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.172s ( 0.221s elapsed)
GC time 0.000s ( 0.049s elapsed)
EXIT time 0.000s ( 0.049s elapsed)
Total time 0.188s ( 0.319s elapsed)
%GC time 0.0% (15.3% elapsed)
Alloc rate 4,654,825,378 bytes per MUT second
Productivity 100.0% of total user, 58.7% of total elapsed
Run Code Online (Sandbox Code Playgroud)
那么,将它与8s进行比较.所以对于TL; DR:如果你有编译时常量,并且你想根据这些常量创建和/或修改你的代码,可以考虑使用Template Haskell.
*请注意,这是我写过的第一个模板Haskell代码.小心使用.不要使用太大n,否则最终可能会出现混乱的功能.
Mat*_*ing 15
Andres之前告诉我一个鲜为人知的技巧,你可以通过使用类型类实际获得GHC到内联递归函数.
我们的想法是,而不是通常在一个值上执行结构递归的函数.您可以使用类型类定义函数,并对类型参数执行结构递归.在此示例中,类型级自然数.
GHC将很乐意内联每个递归调用并生成有效的代码,因为每个递归调用都是不同的类型.
我没有对此进行基准测试或查看核心,但它明显更快.
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE KindSignatures #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Main where
import qualified Data.Vector.Unboxed as V
data Proxy a = Proxy
{-# INLINE incAll #-}
incAll :: V.Vector Int -> V.Vector Int
incAll = V.map (+ 1)
oldNTimes :: Int -> (a -> a) -> a -> a
oldNTimes 0 f x = x
oldNTimes n f x = f (oldNTimes (n-1) f x)
-- New definition
data N = Z | S N
class Unroll (n :: N) where
nTimes :: Proxy n -> (a -> a) -> a -> a
instance Unroll Z where
nTimes _ f x = x
instance Unroll n => Unroll (S n) where
nTimes p f x =
let Proxy :: Proxy (S n) = p
in f (nTimes (Proxy :: Proxy n) f x)
main :: IO ()
main = do
let size = 100000000 :: Int
let array = V.replicate size 0 :: V.Vector Int
print $ V.sum (nTimes (Proxy :: Proxy (S (S (S (S (S (S (S (S (S (S (S Z)))))))))))) incAll array)
print $ V.sum (oldNTimes 11 incAll array)
Run Code Online (Sandbox Code Playgroud)