Haskell:扫描列表并为每个元素应用不同的函数

Question

Haskell:扫描列表并为每个元素应用不同的函数

Xan*_*unn 6 haskell functional-programming list

我需要扫描文档并为文件中的每个字符串累积不同函数的输出.在文件的任何给定行上运行的函数取决于该行中的内容.

我可以通过为我想要收集的每个列表完整传递文件来非常低效地执行此操作.示例伪代码:

at :: B.ByteString -> Maybe Atom
at line
    | line == ATOM record = do stuff to return Just Atom
    | otherwise = Nothing

ot :: B.ByteString -> Maybe Sheet
ot line
    | line == SHEET record = do other stuff to return Just Sheet
    | otherwise = Nothing

Run Code Online (Sandbox Code Playgroud)

然后,我会将这些函数映射到文件中的整个行列表中,以获得Atoms和Sheets的完整列表:

mapper :: [B.ByteString] -> IO ()
mapper lines = do
    let atoms = mapMaybe at lines
    let sheets = mapMaybe to lines
    -- Do stuff with my atoms and sheets

Run Code Online (Sandbox Code Playgroud)

但是,这是低效的,因为我正在编写我想要创建的每个列表的整个字符串列表.相反,我想只在线字符串列表中映射一次,在我移动它时识别每一行,然后应用适当的函数并将这些值存储在不同的列表中.

我的C心态想要这样做(伪代码):

mapper' :: [B.ByteString] -> IO ()
mapper' lines = do
    let atoms = []
    let sheets = []
    for line in lines:
        | line == ATOM record = (atoms = atoms ++ at line)
        | line == SHEET record = (sheets = sheets ++ ot line)
    -- Now 'atoms' is a complete list of all the ATOM records
    --  and 'sheets' is a complete list of all the SHEET records

Run Code Online (Sandbox Code Playgroud)

Haskell的做法是什么？我根本无法得到我的功能编程思维方式来提出解决方案.

Answer 1

Joh*_*n L 10

首先,我认为其他人提供的答案将至少在95%的时间内起作用.通过使用适当的数据类型(或某些情况下的元组)来编码手头的问题总是好的做法.但是,有时候你真的不知道你在列表中找到了什么,在这些情况下,试图列举所有可能性是困难/耗时/容易出错的.或者,您正在编写同一类型的多个变体(手动将多个折叠内联到一个中),并且您希望捕获抽象.

幸运的是,有一些技术可以提供帮助.

框架解决方案

(有点自我宣传)

首先,各种"iteratee/enumerator"包通常提供处理这类问题的功能.我最熟悉iteratee,它可以让你做到以下几点:

import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Maybe

-- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
-- if you want to just return them as a list, you can use the built-in
-- stream2list function

-- next, create stream transformers
-- given at :: B.ByteString -> Maybe Atom
-- create a stream transformer from ByteString lines to Atoms
atIter :: Enumeratee [B.ByteString] [Atom] m a
atIter = I.mapChunks (catMaybes . map at)

otIter :: Enumeratee [B.ByteString] [Sheet] m a
otIter = I.mapChunks (catMaybes . map ot)

-- finally, combine multiple processors into one
-- if you have more than one processor, you can use zip3, zip4, etc.
procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)

-- and run it on some data
runner :: FilePath -> IO ([Atom],[Sheet])
runner filename = do
  resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
  run resultIter

Run Code Online (Sandbox Code Playgroud)

这给你带来的好处是额外的可组合性.您可以根据自己的喜好创建变形金刚,并将它们与zip结合起来.如果你愿意,你甚至可以并行运行消费者(虽然只有你在IOmonad 工作,并且可能不值得,除非消费者做了很多工作)改为:

import Data.Iteratee.Parallel

parProcFile = I.zip (parI $ atIter =$ stream2list) (parI $ otIter =$ stream2list)

Run Code Online (Sandbox Code Playgroud)

这样做的结果与单个for循环不同 - 这仍然会执行多次遍历数据.但是,遍历模式已经改变.这将一次加载一定量的数据(defaultBufSize字节)并多次遍历该块,并根据需要存储部分结果.在完全消耗了一个块之后,下一个块被加载并且旧的块可以被垃圾收集.

希望这将证明不同之处:

Data.List.zip:
  x1 x2 x3 .. x_n
                   x1 x2 x3 .. x_n

Data.Iteratee.zip:
  x1 x2      x3 x4      x_n-1 x_n
       x1 x2      x3 x4           x_n-1 x_n

Run Code Online (Sandbox Code Playgroud)

如果你做的工作足够平行,那么这根本不是问题.由于内存局部性,性能比整个输入上的多次遍历要好得多Data.List.zip.

美丽的解决方案

如果一个单遍历解决方案确实最有意义,你可能会对Max Rabkin的Beautiful Folding帖子和Conal Elliott的后续工作感兴趣(这也是如此).基本的想法是,您可以创建数据结构来表示折叠和拉链,并且组合这些可以创建一个新的组合折叠/拉链功能,只需要一次遍历.对于Haskell初学者来说,这可能有点先进,但既然你正在考虑这个问题,你可能会觉得它很有趣或有用.马克斯的帖子可能是最好的起点.

Answer 2

dav*_*420 5

我展示了两种类型的线的解决方案,但是通过使用五元组而不是两元组,它很容易扩展到五种类型的线.

import Data.Monoid

eachLine :: B.ByteString -> ([Atom], [Sheet])
eachLine bs | isAnAtom bs = ([ {- calculate an Atom -} ], [])
            | isASheet bs = ([], [ {- calculate a Sheet -} ])
            | otherwise = error "eachLine"

allLines :: [B.ByteString] -> ([Atom], [Sheet])
allLines bss = mconcat (map eachLine bss)

Run Code Online (Sandbox Code Playgroud)

神奇的是通过完成mconcat从Data.Monoid(包含GHC).

(在一个风格点:我个人会定义一个Line类型,一个parseLine :: B.ByteString -> Line函数和写eachLine bs = case parseLine bs of ....但这是你的问题的外围.)

归档时间：	14 年前
查看次数：	1483 次
最近记录：	14 年前