为什么我的函数的pointfree版本使用更多的内存

sem*_*lon 16 haskell pointfree

我正在研究Project Euler问题并最终得到一个Haskell文件,其中包含一个如下所示的函数:

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr (\(cs', n) a -> fromBool (f cs cs') * n + a) 0
Run Code Online (Sandbox Code Playgroud)

随着fromBool进口Foreign.Marshal.Utils只是为了快速转换True1False0.

我试图从我的解决方案中获得更多的速度,所以我尝试切换foldrfoldl'(在过程中切换参数),因为我认为foldr在数字上使用没有多大意义.

根据GHC的分析器,从切换foldrfoldl'使我分配的内存超过两倍.

为了好玩,我还决定用函数的pointfree版本替换lambda:

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr ((+) . uncurry ((*) . fromBool . f cs)) 0
Run Code Online (Sandbox Code Playgroud)

这导致我的内存分配从foldr版本增加了20倍.

现在这不是一个大问题,因为即使在20x情况下总内存分配只是大约,135Mb并且程序的运行时相对不受影响,如果有更高内存分配版本运行得稍快.

但我真的很好奇这些结果是如何可能的,以便将来我能够选择"正确"的功能,因为我没有那么多的余地.

编辑:

GHC版本7.10.2,编译用-O2 -prof -fprof-auto.执行+RTS -p.

编辑2:

好吧,看起来这很难重现,省略其余的代码,这里是整个程序:

下面的掠夺者:

{-# LANGUAGE NoMonomorphismRestriction #-}

import Control.Monad
import Data.List
import Foreign.Marshal.Utils

data Color = Red | Green | Blue deriving (Eq, Enum, Bounded, Show)

colors :: [Color]
colors = [Red ..]

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f x = foldr ((+) . uncurry ((*) . fromBool . f x)) 0
-- matches f x = foldr (\(y, n) a -> fromBool (f x y) * n + a) 0
-- matches f x = foldl' (\a (y, n) -> fromBool (f x y) * n + a) 0

invert :: [([Color], Int)] -> [([Color], Int)]
invert rs = (\cs -> (cs, matches valid cs rs)) <$> choices
  where
    len = maximum $ length . fst <$> rs
    choices = replicateM len colors
    valid (x : xs) (y : ys) = x /= y && valid xs ys
    valid _ _ = True

expand :: [([Color], Int)] -> [([Color], Int)]
expand rs = (\cs -> (cs, matches valid cs rs)) <$> choices
  where
    len = maximum $ length . fst <$> rs
    choices = replicateM (len + 1) colors
    valid (x1 : x2 : xs) (y : ys) = x1 /= y && x2 /= y && valid (x2 : xs) ys
    valid _ _ = True

getRow :: Int -> [([Color], Int)]
getRow 1 = flip (,) 1 . pure <$> colors
getRow n = expand . invert $ getRow (n - 1)

result :: Int -> Int
result n = sum $ snd <$> getRow n

main :: IO ()
main = print $ result 8
Run Code Online (Sandbox Code Playgroud)

Zet*_*eta 13

注意:这篇文章是用文字Haskell编写的.将其复制到一个文件中,保存为*.lhs,然后在GHC(i)中编译/加载.此外,在您编辑代码之前,我开始编写此答案,但课程保持不变.

TL; DR

这个Prelude函数uncurry太懒了,而你的模式匹配就足够严格了.

谨慎和免责声明

我们正在进入一个神奇而奇怪的地方.谨防.此外,我的核心能力是初步的.现在我失去了所有的信誉,让我们开始吧.

经过测试的代码

为了知道我们在哪里获得额外的内存要求,拥有两个以上的函数是很有用的.

> import Control.Monad (forM_)
Run Code Online (Sandbox Code Playgroud)

这是您原始的,无点免费变体:

> matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matches    f cs = foldr (\(cs', n) a -> fromEnum (f cs cs') * n + a) 0
Run Code Online (Sandbox Code Playgroud)

这是一个仅略微无点的变体,参数a是eta减少的.

> matchesPF' :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPF' f cs = foldr (\(cs', n) -> (+) (fromEnum (f cs cs') * n)) 0
Run Code Online (Sandbox Code Playgroud)

这是一种uncurry手工内联的变体.

> matchesPFI :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFI f cs = foldr ((+) . (\(cs', n) -> fromEnum (f cs cs') * n)) 0
Run Code Online (Sandbox Code Playgroud)

这是你的免费版本.

> matchesPF :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPF  f cs = foldr ((+) . uncurry  ((*) . fromEnum . f cs)) 0
Run Code Online (Sandbox Code Playgroud)

这是使用自定义的变体uncurry,请参见下文.

> matchesPFU :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFU f cs = foldr ((+) . uncurryI ((*) . fromEnum . f cs)) 0
Run Code Online (Sandbox Code Playgroud)

这是一个使用自定义懒惰的变体uncurry,见下文.

> matchesPFL :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFL f cs = foldr ((+) . uncurryL ((*) . fromEnum . f cs)) 0
Run Code Online (Sandbox Code Playgroud)

为了轻松测试功能,我们使用一个列表:

> funcs = [matches, matchesPF', matchesPF, matchesPFL, matchesPFU, matchesPFI]
Run Code Online (Sandbox Code Playgroud)

我们自编uncurry:

> uncurryI :: (a -> b -> c) -> (a, b) -> c
> uncurryI f (a,b) = f a b
Run Code Online (Sandbox Code Playgroud)

一个懒人uncurry:

> uncurryL :: (a -> b -> c) -> (a, b) -> c
> uncurryL f p = f (fst p) (snd p)
Run Code Online (Sandbox Code Playgroud)

惰性变体uncurryL具有与变体相同的语义Prelude,例如

uncurry (\_ _ -> 0) undefined == 0 == uncurryL (\_ _ -> 0) undefined
Run Code Online (Sandbox Code Playgroud)

uncurryI在对的脊椎中是严格的.

> main = do
>   let f a b = a < b
>   forM_ [1..10] $ \i ->
>     forM_ funcs $ \m ->
>       print $ m f i (zip (cycle [1..10]) [1..i*100000])
Run Code Online (Sandbox Code Playgroud)

该列表有意[1..i*100000]依赖i,因此我们不会引入CAF并扭曲我们的分配配置文件.

脱毒的代码

在我们深入研究配置文件之前,让我们看看每个函数的desugared代码:

==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 221, types: 419, coercions: 0}

uncurryL
uncurryL = \ @ a @ b @ c f p -> f (fst p) (snd p)

uncurryI
uncurryI = \ @ a @ b @ c f ds -> case ds of _ { (a, b) -> f a b }

-- uncurried inlined by hand
matchesPFI =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (\ ds ->
            case ds of _ { (cs', n) ->
            * $fNumInt (fromEnum $fEnumBool (f cs cs')) n
            }))
      (I# 0)

-- lazy uncurry
matchesPFL =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurryL (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- stricter uncurry
matchesPFU =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurryI (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- normal uncurry
matchesPF =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurry (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- eta-reduced a
matchesPF' =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (\ ds ->
         case ds of _ { (cs', n) ->
         + $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n)
         })
      (I# 0)

-- non-point-free
matches =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (\ ds a ->
         case ds of _ { (cs', n) ->
         + $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n) a
         })
      (I# 0)
Run Code Online (Sandbox Code Playgroud)

到目前为止,一切似乎都很好.没什么可惊讶的.类型类函数被替换为它们的字典变体,例如foldr变成foldr $fFoldable[]`, since we call it on a list.

The profile

   Mon Jul 18 15:47 2016 Time and Allocation Profiling Report  (Final)

       Prof +RTS -s -p -RTS

    total time  =        1.45 secs   (1446 ticks @ 1000 us, 1 processor)
    total alloc = 1,144,197,200 bytes  (excludes profiling overheads)

COST CENTRE  MODULE    %time %alloc

matchesPF'   Main       13.6    0.0
matchesPF    Main       13.3   11.5
main.\.\     Main       11.8   76.9
main.f       Main       10.9    0.0
uncurryL     Main        9.5   11.5
matchesPFU   Main        8.9    0.0
matchesPFI   Main        7.3    0.0
matches      Main        6.9    0.0
matchesPFL   Main        6.3    0.0
uncurryI     Main        5.3    0.0
matchesPF'.\ Main        2.6    0.0
matchesPFI.\ Main        2.0    0.0
matches.\    Main        1.5    0.0


                                                             individual     inherited
COST CENTRE        MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN               MAIN                     44           0    0.0    0.0   100.0  100.0
 main              Main                     89           0    0.0    0.0   100.0  100.0
  main.\           Main                     90          10    0.0    0.0   100.0  100.0
   main.\.\        Main                     92          60   11.8   76.9   100.0  100.0
    funcs          Main                     93           0    0.0    0.0    88.2   23.1
     matchesPFI    Main                    110          10    7.3    0.0    11.7    0.0
      matchesPFI.\ Main                    111     5500000    2.0    0.0     4.4    0.0
       main.f      Main                    112     5500000    2.4    0.0     2.4    0.0
     matchesPFU    Main                    107          10    8.9    0.0    15.3    0.0
      uncurryI     Main                    108     5500000    5.3    0.0     6.4    0.0
       main.f      Main                    109     5500000    1.1    0.0     1.1    0.0
     matchesPFL    Main                    104          10    6.3    0.0    17.7   11.5
      uncurryL     Main                    105     5500000    9.5   11.5    11.4   11.5
       main.f      Main                    106     5500000    1.9    0.0     1.9    0.0
     matchesPF     Main                    102          10   13.3   11.5    15.4   11.5
      main.f       Main                    103     5500000    2.1    0.0     2.1    0.0
     matchesPF'    Main                     99          10   13.6    0.0    17.2    0.0
      matchesPF'.\ Main                    100     5500000    2.6    0.0     3.6    0.0
       main.f      Main                    101     5500000    1.0    0.0     1.0    0.0
     matches       Main                     94          10    6.9    0.0    10.9    0.0
      matches.\    Main                     97     5500000    1.5    0.0     4.0    0.0
       main.f      Main                     98     5500000    2.5    0.0     2.5    0.0
 CAF               Main                     87           0    0.0    0.0     0.0    0.0
  funcs            Main                     91           1    0.0    0.0     0.0    0.0
  main             Main                     88           1    0.0    0.0     0.0    0.0
   main.\          Main                     95           0    0.0    0.0     0.0    0.0
    main.\.\       Main                     96           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Handle.FD         84           0    0.0    0.0     0.0    0.0
 CAF               GHC.Conc.Signal          78           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding          76           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Handle.Text       75           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding.Iconv    59           0    0.0    0.0     0.0    0.0

Ignore the main\.\.噪音,它只是列表.但是,有一点应该立即注意到:matchesPFuncurryL使用相同的alloc%:

matchesPF    Main       13.3   11.5
uncurryL     Main        9.5   11.5
Run Code Online (Sandbox Code Playgroud)

进入核心

现在是时候检查生成的CORE(ghc -ddump-simpl)了.我们会注意到大多数函数已经转换为worker包装器,它们看起来或多或少相同(-dsuppress-all -dsuppress-uniques):

$wa5
$wa5 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case y of _ { (cs', n) ->
              case $wgo ys of ww { __DEFAULT ->
              case w w1 cs' of _ {
                False -> case n of _ { I# y1 -> ww };
                True -> case n of _ { I# y1 -> +# y1 ww }
              }
              }
              }
          }; } in
    $wgo w2
Run Code Online (Sandbox Code Playgroud)

这是你通常的工人包装工.$wgo获取一个列表,检查它是否为空,head(case y of _ { (cs', n) ->…)中是严格的,还是递归结果中的惰性$wgo ys of ww.

所有功能看起来都一样.好吧,除了matchesPF(你的变种)以外

-- matchesPF
$wa3 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case $wgo ys of ww { __DEFAULT ->
              case let {
                     x = case y of _ { (x1, ds) -> x1 } } in
                   case w w1 x of _ {
                     False ->
                       case y of _ { (ds, y1) -> case y1 of _ { I# y2 -> main13 } };
                              -- main13 is just #I 0
                     True -> case y of _ { (ds, y1) -> y1 }
                   }
              of _ { I# x ->
              +# x ww
              }
              }
          }; } in
    $wgo w2
Run Code Online (Sandbox Code Playgroud)

matchesPFL(使用懒惰的变种uncurryL)

-- matchesPFL
$wa2
$wa2 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case $wgo ys of ww { __DEFAULT ->
              case snd y of ww1 { I# ww2 ->
              case let {
                     x = fst y } in
                   case w w1 x of _ {
                     False -> main13;
                     True -> ww1
                   }
              of _ { I# x ->
              +# x ww
              }
              }
              }
          }; } in
    $wgo w2
Run Code Online (Sandbox Code Playgroud)

它们几乎是一样的.它们都包含let绑定.这将产生一个thunk,通常会导致更糟的空间需求.

解决方案

我认为此时的罪魁祸首很明显.是的uncurry.GHC希望强制执行正确的语义

uncurry (const (const 0)) undefined
Run Code Online (Sandbox Code Playgroud)

然而,这增加了懒惰和额外的thunk.您的非pointfree变体不会引入该行为,因为您在对上进行模式匹配:

foldr (\(cs', n) a -> …)
Run Code Online (Sandbox Code Playgroud)

还是不相信我?使用惰性模式匹配

foldr (\ ~(cs', n) a -> …)
Run Code Online (Sandbox Code Playgroud)

你会发现它的matches行为与...相同matchesPF.所以使用稍微严格的变体uncurry.uncurryI足以给严格分析器一个提示.

请注意,对这种行为是众所周知的.RWH花了整整一章试图优化单个函数的行为,其中中间对导致问题.