为什么添加INLINE会使我的程序变慢

Question

为什么添加INLINE会使我的程序变慢

我一直在寻找一种foldl可以在无限列表上使用的方法，用于无法获得受保护的递归但根据第一个参数可能无法使用第二个参数的情况。

例如乘法，通常需要参数和受保护的递归都无法使用，但是如果第一个参数为0，则可能会短路。

所以我写了以下函数：

foldlp :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlp f p = go where
    go b [] = b
    go b (x : xs) 
        | p b = go (f b x) xs
        | otherwise = b

Run Code Online (Sandbox Code Playgroud)

并使用我的自定义短路乘法功能对其进行了测试：

 mult :: Integer -> Integer -> Integer
 mult 0 _ = 0
 mult x y = x * y

 main :: IO ()
 main = print . <test_function>

Run Code Online (Sandbox Code Playgroud)

我得到的结果-prof -fprof-auto -O2，+RTS -p分别为：

foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
total time = 0.40 secs
total alloc = 480,049,336 bytes

foldlp mult (`seq` True) 1 $ replicate (10 ^ 7) 1
total time = 0.37 secs
total alloc = 480,049,336 bytes

foldl' mult 1 $ replicate (10 ^ 7) 1
total time = 0.37 secs
total alloc = 480,049,352 bytes

foldl mult 1 $ replicate (10 ^ 7) 1
total time = 0.74 secs
total alloc = 880,049,352 bytes

foldr mult 1 $ replicate (10 ^ 7) 1
total time = 0.87 secs
total alloc = 880,049,336 bytes

Run Code Online (Sandbox Code Playgroud)

这是非常有前途的，因为我的自定义函数允许灵活的严格性类型，并且还可以用于无限列表

因为它击中第一个例子会尽快终止0了，至于意志foldr，而是foldr要慢得多。

它避免了诸如在元组((1 + 2) + 3, (10 + 20) + 30)中发生重击之类的问题，就像在WHNF中技术上所见那样，它会中断foldl'。

您可以重新获得foldl与flip foldl (const True)和foldl'与flip foldl (序列True)。这样做似乎可以重新获得原始受限功能的性能特征。

因此，作为旁注，我认为foldlp这将是一个值得补充的补充Foldable。

但是我的实际问题是，为什么当我添加{-# INLINE foldlp #-}函数时性能显着下降，给我以下信息：

foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
total time = 0.67 secs
total alloc = 800,049,336 bytes

Run Code Online (Sandbox Code Playgroud)

所以我真正的问题是为什么会这样。我认为内联的缺点是代码膨胀，对运行时性能和增加的内存使用没有明显的负面影响。

Answer 1

Chr*_*amm 5

根据GHC文档，该INLINE杂注会阻止其他编译器优化，以使重写规则仍然生效。

因此，我的猜测是，通过使用INLINE您删除了一些优化，GHC可以应用来使您的代码更快。

经过一番探索（-ddump-simpl在编译中使用），我发现了GHC所执行的优化。为此，我研究了foldlp内联和不内联的核心：

内联：

foldlp =
  \ (@ b_a10N)
    (@ a_a10O)
    (eta_B2 :: b_a10N -> a_a10O -> b_a10N)
    (eta1_B1 :: b_a10N -> Bool)
    (eta2_X3 :: b_a10N)
    (eta3_X5 :: [a_a10O]) ->
    letrec {
      go_s1Ao [Occ=LoopBreaker] :: b_a10N -> [a_a10O] -> b_a10N
      [LclId, Arity=2, Str=DmdType <L,U><S,1*U>]
      go_s1Ao =
        \ (b1_avT :: b_a10N) (ds_d1xQ :: [a_a10O]) ->
        -- Removed the actual definition of go for brevity,
        -- it's the same in both cases
          }; } in
    go_s1Ao eta2_X3 eta3_X5

Run Code Online (Sandbox Code Playgroud)

非内联：

foldlp =
  \ (@ b_a10N)
    (@ a_a10O)
    (f_avQ :: b_a10N -> a_a10O -> b_a10N)
    (p_avR :: b_a10N -> Bool) ->
    letrec {
      go_s1Am [Occ=LoopBreaker] :: b_a10N -> [a_a10O] -> b_a10N
      [LclId, Arity=2, Str=DmdType <L,U><S,1*U>]
      go_s1Am =
        \ (b1_avT :: b_a10N) (ds_d1xQ :: [a_a10O]) ->
        -- Removed the actual definition of go for brevity,
        -- it's the same in both cases
          }; } in
    go_s1Am

Run Code Online (Sandbox Code Playgroud)

相关的区别在最后一行。优化带走的其实不必调用步骤foldlp调用go，只是做一个函数有两个参数进行foldlp的是回报有两个参数的函数。使用内联时，不会执行此优化，并且内核看起来与您编写的代码完全相同。

我通过编写以下三个变量来验证这一点foldlp：

module Main where

foldlp :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlp f p = go where
      go b [] = b
      go b (x : xs)
        | p b = go (f b x) xs
        | otherwise = b

{-# INLINE foldlpInline #-}
foldlpInline :: (b -> a -> b) -> (b -> Bool) -> b -> [a] -> b
foldlpInline f p = go where
      go b [] = b
      go b (x : xs)
        | p b = go (f b x) xs
        | otherwise = b


{-# INLINE foldlp' #-} -- So that the code is not optimized
foldlp' b [] = b
foldlp' b (x : xs)
        | (/= 0) b = foldlp' (mult b x) xs
        | otherwise = b

mult :: Integer -> Integer -> Integer
mult 0 _ = 0
mult x y = x * y

--main = print $ foldlp mult (/= 0) 1 $ replicate (10 ^ 7) 1
--main = print $ foldlpInline mult (/= 0) 1 $ replicate (10 ^ 7) 1
main = print $ foldlp' 1 $ replicate (10 ^ 7) 1

Run Code Online (Sandbox Code Playgroud)

结果是：

第一种情况（普通非内联）：

./test  0,42s user 0,01s system 96% cpu 0,446 total

Run Code Online (Sandbox Code Playgroud)

第二种情况（内联）：

./test  0,83s user 0,02s system 98% cpu 0,862 total

Run Code Online (Sandbox Code Playgroud)

第三种情况（编译器为非内联程序产生的结果）

./test  0,42s user 0,01s system 99% cpu 0,432 total

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年前
查看次数：	121 次
最近记录：	9 年前