匹配Parsec中的字节串

Ari*_*ira 4 haskell parsec bytestring

我目前正在尝试使用Real World Haskell中提供的Full CSV Parser .为了我试图修改代码ByteString而不是使用String,但有一个string组合使用String.

是否有一个秒差距组合子相似,string与之配合ByteString,而无需做转换来回?

我已经看到有一个替代解析器可以处理ByteString:attoparsec但我更愿意坚持使用Parsec,因为我只是学习如何使用它.

Bee*_*tle 5

我假设你是从类似的东西开始的

import Prelude hiding (getContents, putStrLn)
import Data.ByteString
import Text.Parsec.ByteString
Run Code Online (Sandbox Code Playgroud)

这是我到目前为止所得到的.有两个版本.都编译.可能两者都不是你想要的,但它们应该有助于讨论并帮助你澄清你的问题.

我一路上注意到的东西:

  • 如果你import Text.Parsec.ByteString然后使用unconsData.ByteString.Char8,它又w2c从Data.ByteString.Internal 使用,将所有读取的字节转换为Chars.这使得Parsec的行号和列号错误报告能够合理地工作,并且还使您能够string毫无问题地使用和朋友.

因此,CSV解析器的简易版本就是这样做的:

import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)

import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (getContents)

import Text.Parsec
import Text.Parsec.ByteString

csvFile :: Parser [[String]]
csvFile = endBy line eol
line :: Parser [String]
line = sepBy cell (char ',')
cell :: Parser String
cell = quotedCell <|> many (noneOf ",\n\r")

quotedCell :: Parser String
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return content

quotedChar :: Parser Char
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return '"')

eol :: Parser String
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

parseCSV :: ByteString -> Either ParseError [[String]]
parseCSV = parse csvFile "(unknown)"

main :: IO ()
main =
    do c <- ByteString.getContents
       case parse csvFile "(stdin)" c of
            Left e -> do Prelude.putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r
Run Code Online (Sandbox Code Playgroud)

但是这样做是微不足道的,我认为它不可能是你想要的.也许你希望一切都保持一个ByteString[Word8]类似的东西一直通过?因此我的第二次尝试如下.我还在import荷兰国际集团Text.Parsec.ByteString,这可能是一个错误,而且代码与转换绝望千疮百孔.

但是,它编译并具有完整的类型注释,因此应该成为一个合理的起点.

import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)
import Control.Monad (liftM)

import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (pack, getContents)
import qualified Data.ByteString.Char8 as Char8 (pack)

import Data.Word (Word8)
import Data.ByteString.Internal (c2w)

import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many)
import Text.Parsec.ByteString
import Text.Parsec.Prim (tokens, tokenPrim)
import Text.Parsec.Pos (updatePosChar, updatePosString)
import Text.Parsec.Error (ParseError)

csvFile :: Parser [[ByteString]]
csvFile = endBy line eol
line :: Parser [ByteString]
line = sepBy cell (char ',')
cell :: Parser ByteString
cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r"))

quotedCell :: Parser ByteString
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return (ByteString.pack content)

quotedChar :: Parser Word8
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return (c2w '"'))

eol :: Parser ByteString
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

parseCSV :: ByteString -> Either ParseError [[ByteString]]
parseCSV = parse csvFile "(unknown)"

main :: IO ()
main =
    do c <- ByteString.getContents
       case parse csvFile "(stdin)" c of
            Left e -> do Prelude.putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r

-- replacements for some of the functions in the Parsec library

noneOf :: String -> Parser Word8
noneOf cs   = satisfy (\b -> b `notElem` [c2w c | c <- cs])

char :: Char -> Parser Word8
char c      = byte (c2w c)

byte :: Word8 -> Parser Word8
byte c      = satisfy (==c)  <?> show [c]

satisfy :: (Word8 -> Bool) -> Parser Word8
satisfy f   = tokenPrim (\c -> show [c])
                        (\pos c _cs -> updatePosChar pos c)
                        (\c -> if f (c2w c) then Just (c2w c) else Nothing)

string :: String -> Parser ByteString
string s    = liftM Char8.pack (tokens show updatePosString s)
Run Code Online (Sandbox Code Playgroud)

也许你的关心,效率的角度来看,应该是这两个ByteString.pack指令中的定义cellquotedCell.您可以尝试让,而不是"制定严密的字节串的一个实例来代替Text.Parsec.ByteString模块StreamChar令牌类型,"你让字节串的实例StreamWord8标记类型,但是这不会帮助你与效率,反而会只是让您头疼,尝试重新实现所有sourcePos函数,以跟踪您在错误消息输入中的位置.

不,方法,使之更有效率是改变的类型char,quotedCharstringParser [Word8]和类型line,并csvFileParser [[Word8]]Parser [[[Word8]]]分别.你甚至可以改变的类型eolParser ().必要的更改看起来像这样:

cell :: Parser [Word8]
cell = quotedCell <|> many (noneOf ",\n\r")

quotedCell :: Parser [Word8]
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return content

string :: String -> Parser [Word8]
string s    = [c2w c | c <- (tokens show updatePosString s)]
Run Code Online (Sandbox Code Playgroud)

c2w就效率而言,您无需担心所有呼叫,因为它们不需要任何费用.

如果这不能回答你的问题,请说明会是什么.