Gre*_*con 16 xml tag-soup haskell large-data large-scale
我一直在探索Stack Overflow数据转储,因此利用友好的XML和正则表达式"解析".我尝试使用各种Haskell XML库来查找特定用户按文档顺序排列的第一篇文章都遭遇了令人讨厌的颠簸.
import Control.Monad
import Text.HTML.TagSoup
userid = "83805"
main = do
posts <- liftM parseTags (readFile "posts.xml")
print $ head $ map (fromAttrib "Id") $
filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
posts
Run Code Online (Sandbox Code Playgroud)
import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath
userid = "83805"
main = do
runX $ readDoc "posts.xml" >>> posts >>> arr head
where
readDoc = readDocument [ (a_tagsoup, v_1)
, (a_parse_xml, v_1)
, (a_remove_whitespace, v_1)
, (a_issue_warnings, v_0)
, (a_trace, v_1)
]
posts :: ArrowXml a => a XmlTree String
posts = getXPathTrees byUserId >>>
getAttrValue "Id"
where byUserId = "/posts/row/@OwnerUserId='" ++ userid ++ "'"
Run Code Online (Sandbox Code Playgroud)
import Control.Monad
import Control.Monad.Error
import Control.Monad.Trans.Maybe
import Data.Either
import Data.Maybe
import Text.XML.Light
userid = "83805"
main = do
[posts,votes] <- forM ["posts", "votes"] $
liftM parseXML . readFile . (++ ".xml")
let ps = elemNamed "posts" posts
putStrLn $ maybe "<not present>" show
$ filterElement (byUser userid) ps
elemNamed :: String -> [Content] -> Element
elemNamed name = head . filter ((==name).qName.elName) . onlyElems
byUser :: String -> Element -> Bool
byUser id e = maybe False (==id) (findAttr creator e)
where creator = QName "OwnerUserId" Nothing Nothing
Run Code Online (Sandbox Code Playgroud)
我哪里做错了?使用Haskell处理大量XML文档的正确方法是什么?
Don*_*art 17
我注意到你在所有这些情况下都在做String IO.如果希望有效地处理大量文本,则必须使用Data.Text或Data.Bytestring(.Lazy),如String == [Char],这对于非常大的平面文件来说是不恰当的表示.
那意味着你需要使用支持字节串的Haskell XML库.这里有几个xml库:http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml
我不确定哪个支持字节串,但这是你正在寻找的条件.
下面是一个使用hexpat的示例:
{-# LANGUAGE PatternGuards #-}
module Main where
import Text.XML.Expat.SAX
import qualified Data.ByteString.Lazy as B
userid = "83805"
main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
where earliest :: B.ByteString -> SAXEvent String String
earliest = head . filter (ownedBy userid) . parse opts
opts = ParserOptions Nothing Nothing
ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (StartElement "row" as)
| Just ouid <- lookup "OwnerUserId" as = ouid == uid
| otherwise = False
ownedBy _ _ = False
Run Code Online (Sandbox Code Playgroud)
定义ownedBy
有点笨重.也许是视图模式:
{-# LANGUAGE ViewPatterns #-}
module Main where
import Text.XML.Expat.SAX
import qualified Data.ByteString.Lazy as B
userid = "83805"
main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
where earliest :: B.ByteString -> SAXEvent String String
earliest = head . filter (ownedBy userid) . parse opts
opts = ParserOptions Nothing Nothing
ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (ownerUserId -> Just ouid) = uid == ouid
ownedBy _ _ = False
ownerUserId :: SAXEvent String String -> Maybe String
ownerUserId (StartElement "row" as) = lookup "OwnerUserId" as
ownerUserId _ = Nothing
Run Code Online (Sandbox Code Playgroud)
小智 8
你可以尝试我的fast-tagsoup库.它是一个简单的替代tagoup和解析速度为20-200MB /秒.
tagoup包的问题在于,即使您使用Text或ByteString接口,它也可以在内部使用String.fast-tagsoup使用严格的ByteStrings,使用高性能的低级解析,同时仍然将惰性标签列表作为输出返回.
归档时间: |
|
查看次数: |
3858 次 |
最近记录: |