Joe*_*and 17 monads haskell web-scraping
我正在尝试使用Haskell来搜索网页并将结果编译成一个对象.
如果由于某种原因,我无法从页面中获取所有项目,我想停止尝试处理页面并提前返回.
例如:
scrapePage :: String -> IO ()
scrapePage url = do
doc <- fromUrl url
title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
when (isNothing title) (return ())
date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
when (isNothing date) (return ())
-- etc
-- make page object and send it to db
return ()
Run Code Online (Sandbox Code Playgroud)
问题是when不会停止执行阻止或保持其他部分不被执行.
这样做的正确方法是什么?
Phi*_* JF 18
return在haskell中,与return其他语言不同.相反,return将值注入monad(在本例中IO)是什么.你有几个选择
最简单的是使用if
scrapePage :: String -> IO ()
scrapePage url = do
doc <- fromUrl url
title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
if (isNothing title) then return () else do
date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
if (isNothing date) then return () else do
-- etc
-- make page object and send it to db
return ()
Run Code Online (Sandbox Code Playgroud)
另一种选择是使用 unless
scrapePage url = do
doc <- fromUrl url
title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
unless (isNothing title) do
date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
unless (isNothing date) do
-- etc
-- make page object and send it to db
return ()
Run Code Online (Sandbox Code Playgroud)
这里的一般问题是IOmonad没有控制效果(例外情况除外).另一方面,你可以使用monad变压器
scrapePage url = liftM (maybe () id) . runMaybeT $ do
doc <- liftIO $ fromUrl url
title <- liftIO $ liftM headMay $ runX $ doc >>> css "head.title" >>> getText
guard (isJust title)
date <- liftIO $ liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
guard (isJust date)
-- etc
-- make page object and send it to db
return ()
Run Code Online (Sandbox Code Playgroud)
如果你真的想要获得全面的控制效果,你需要使用 ContT
scrapePage :: String -> IO ()
scrapePage url = runContT return $ do
doc <- fromUrl url
title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
when (isNothing title) $ callCC ($ ())
date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
when (isNothing date) $ callCC ($ ())
-- etc
-- make page object and send it to db
return ()
Run Code Online (Sandbox Code Playgroud)
警告:以上代码均未经过测试,甚至未经过类型检查!
dav*_*420 13
使用monad变压器!
import Control.Monad.Trans.Class -- from transformers package
import Control.Error.Util -- from errors package
scrapePage :: String -> IO ()
scrapePage url = maybeT (return ()) return $ do
doc <- lift $ fromUrl url
title <- liftM headMay $ lift . runX $ doc >>> css "head.title" >>> getText
guard . not $ isNothing title
date <- liftM headMay $ lift . runX $ doc >>> css "span.dateTime" ! "data-utc"
guard . not $ isNothing date
-- etc
-- make page object and send it to db
return ()
Run Code Online (Sandbox Code Playgroud)
为了在早期返回时更灵活地返回值,请使用throwError/ eitherT/ EitherT而不是mzero/ maybeT/ MaybeT.(虽然你不能使用guard.)
(也可能使用headZ而不是headMay明确地放弃guard.)