我正在尝试使用Parsec编写一个解析器,它将解析有文化的Haskell文件,如下所示:
The classic 'Hello, world' program.
\begin{code}
main = putStrLn "Hello, world"
\end{code}
More text.
Run Code Online (Sandbox Code Playgroud)
我写了以下内容,受到RWH中的例子的启发:
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= many codeOrProse
codeOrProse
= code <|> prose
code
= do eol
string "\\begin{code}"
eol
content <- many anyChar
eol
string "\\end{code}"
eol
return $ Haskell content
prose
= do content <- many anyChar
return $ Text content
eol
= try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
Run Code Online (Sandbox Code Playgroud)
我希望这会导致以下内容:
[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
Run Code Online (Sandbox Code Playgroud)
(允许空格等).
这编译很好,但运行时,我收到错误:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
Run Code Online (Sandbox Code Playgroud)
任何人都可以对此有所了解,并可能帮助解决方案吗?
正如......所指出的那样many anyChar是问题所在.但不只是在,prose而且在code.问题code是,这content <- many anyChar将消耗所有内容:换行符和\end{code}标记.
所以,你需要有一些方法来分辨散文和代码.一种简单(但可能太天真)的方法是寻找反斜杠:
literateFile = many codeOrProse <* eof
code = do string "\\begin{code}"
content <- many $ noneOf "\\"
string "\\end{code}"
return $ Haskell content
prose = do content <- many1 $ noneOf "\\"
return $ Text content
Run Code Online (Sandbox Code Playgroud)
现在,你没有完全得到所需的结果,因为该Haskell部分也将包含换行符,但你可以很容易地过滤掉这些(给定一个filterNewlines你可以说的函数`content <- filterNewlines <$> (many $ noneOf "\\")).
编辑
好吧,我想我找到了一个解决方案(需要最新的Parsec版本,因为lookAhead):
import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "" input
literateFile
= many codeOrProse
codeOrProse = code <|> prose
code = do string "\\begin{code}\n"
c <- untilP (string "\\end{code}\n")
string "\\end{code}\n"
return $ Haskell c
prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
return $ Text t
untilP p = do s <- many $ noneOf "\n"
newline
s' <- try (lookAhead p >> return "") <|> untilP p
return $ s ++ s'
Run Code Online (Sandbox Code Playgroud)
untilP p解析一行,然后检查是否可以成功解析下一行的开头p.如果是这样,它返回空字符串,否则继续.这lookAhead是必需的,因为否则begin\end-tags将被消耗并且code无法识别它们.
我想它仍然可以更简洁(即不必重复string "\\end{code}\n"内部code).
我没有测试过,但是:
many anyChar 可以匹配一个空字符串prose可以匹配空字符串codeOrProse可以匹配空字符串literateFile可以永远循环,匹配无限多个空字符串更改prose为匹配many1字符可能会解决此问题.
(我对Parsec不太熟悉,但是如何prose知道它应该匹配多少个字符?它可能会消耗整个输入,从不给code解析器第二次机会来寻找新代码段的开始.或者它可能只是在每个调用中匹配一个字符,使many/ many1在其中无用.)