Parsec - 错误"combinator'more'应用于接受空字符串的解析器"

stu*_*ith 7 haskell parsec

我正在尝试使用Parsec编写一个解析器,它将解析有文化的Haskell文件,如下所示:

The classic 'Hello, world' program.

\begin{code}

main = putStrLn "Hello, world"

\end{code}

More text.
Run Code Online (Sandbox Code Playgroud)

我写了以下内容,受到RWH中的例子的启发:

import Text.ParserCombinators.Parsec

main
    = do contents <- readFile "hello.lhs"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    deriving (Show)


parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "(unknown)" input



literateFile
    = many codeOrProse

codeOrProse
    = code <|> prose

code
    = do eol
         string "\\begin{code}"
         eol
         content <- many anyChar
         eol
         string "\\end{code}"
         eol
         return $ Haskell content

prose
    = do content <- many anyChar
         return $ Text content

eol
    =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"
Run Code Online (Sandbox Code Playgroud)

我希望这会导致以下内容:

[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
Run Code Online (Sandbox Code Playgroud)

(允许空格等).

这编译很好,但运行时,我收到错误:

*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
Run Code Online (Sandbox Code Playgroud)

任何人都可以对此有所了解,并可能帮助解决方案吗?

bzn*_*bzn 8

正如......所指出的那样many anyChar是问题所在.但不只是在,prose而且在code.问题code是,这content <- many anyChar将消耗所有内容:换行符和\end{code}标记.

所以,你需要有一些方法来分辨散文和代码.一种简单(但可能太天真)的方法是寻找反斜杠:

literateFile = many codeOrProse <* eof

code = do string "\\begin{code}"
          content <- many $ noneOf "\\"
          string "\\end{code}"
          return $ Haskell content

prose = do content <- many1 $ noneOf "\\"
           return $ Text content
Run Code Online (Sandbox Code Playgroud)

现在,你没有完全得到所需的结果,因为该Haskell部分也将包含换行符,但你可以很容易地过滤掉这些(给定一个filterNewlines你可以说的函数`content <- filterNewlines <$> (many $ noneOf "\\")).

编辑

好吧,我想我找到了一个解决方案(需要最新的Parsec版本,因为lookAhead):

import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))

main
    = do contents <- readFile "hello.lhs"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    deriving (Show)    

parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "" input

literateFile
    = many codeOrProse

codeOrProse = code <|> prose

code = do string "\\begin{code}\n"
          c <- untilP (string "\\end{code}\n")
          string "\\end{code}\n"
          return $ Haskell c

prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
           return $ Text t

untilP p = do s <- many $ noneOf "\n"
              newline
              s' <- try (lookAhead p >> return "") <|> untilP p
              return $ s ++ s'
Run Code Online (Sandbox Code Playgroud)

untilP p解析一行,然后检查是否可以成功解析下一行的开头p.如果是这样,它返回空字符串,否则继续.这lookAhead是必需的,因为否则begin\end-tags将被消耗并且code无法识别它们.

我想它仍然可以更简洁(即不必重复string "\\end{code}\n"内部code).


sth*_*sth 6

我没有测试过,但是:

  • many anyChar 可以匹配一个空字符串
  • 因此prose可以匹配空字符串
  • 因此codeOrProse可以匹配空字符串
  • 因此literateFile可以永远循环,匹配无限多个空字符串

更改prose为匹配many1字符可能会解决此问题.

(我对Parsec不太熟悉,但是如何prose知道它应该匹配多少个字符?它可能会消耗整个输入,从不给code解析器第二次机会来寻找新代码段的开始.或者它可能只是在每个调用中匹配一个字符,使many/ many1在其中无用.)