fir*_*iku 1 sorting haskell data-structures word-frequency
我想从Google N-Grams数据集中提取最常用的单词,其未压缩形式大约为20 GB.我不希望使用整个数据集,只需要最频繁的5000个数据集.但如果我写
take 5000 $ sortBy (flip $ comparing snd) dataset
-- dataset :: IO [(word::String, frequency::Int)]
Run Code Online (Sandbox Code Playgroud)
这将是一个无尽的等待.但我该怎么做呢?
我知道Data.Array.MArray包可用于就地数组计算,但我在其文档页面上看不到任何修改项目的功能.还有Data.HashTable.IO,但它是一个无序的数据结构.
我想使用简单Data.IntMap.Strict(具有方便的lookupLE功能),但我不认为它会非常有效,因为它会在每次更改时生成一个新的地图.难道ST单子提高吗?
UPD:我还在CoreReview.SX上发布了程序的最终版本.
怎么样
splitAt划分数据集到第5000项和休息.然后,该过程变为有效线性,但如果对具有次线性min-delete和插入的已排序的5000个元素使用数据结构,则系数会得到改善.
import Data.List (foldl')
import Data.Maybe (fromJust)
import Data.Heap hiding (splitAt)
mostFreq :: Int -> [(String, Int)] -> [(String, Int)]
mostFreq n dataset = final
where
-- change our pairs from (String,Int) to (Int,String)
pairs = map swap dataset
-- get the first `n` pairs in one list, and the rest of the pairs in another
(first, rest) = splitAt n pairs
-- put all the first `n` pairs into a MinHeap
start = fromList first :: MinHeap (Int, String)
-- then run through the rest of the pairs
stop = foldl' step start rest
-- modifying the heap to replace its least frequent pair
-- with the new pair if the new pair is more frequent
step heap pair = if viewHead heap < Just pair
then insert pair (fromJust $ viewTail heap)
else heap
-- turn our heap of (Int, String) pairs into a list of (String,Int) pairs
final = map swap (toList stop)
swap ~(a,b) = (b,a)
Run Code Online (Sandbox Code Playgroud)