最有效的数据结构,用于查找最常见的项目

Question

最有效的数据结构,用于查找最常见的项目

fir*_*iku 1 sorting haskell data-structures word-frequency

我想从Google N-Grams数据集中提取最常用的单词,其未压缩形式大约为20 GB.我不希望使用整个数据集,只需要最频繁的5000个数据集.但如果我写

take 5000 $ sortBy (flip $ comparing snd) dataset
-- dataset :: IO [(word::String, frequency::Int)]

Run Code Online (Sandbox Code Playgroud)

这将是一个无尽的等待.但我该怎么做呢？

我知道Data.Array.MArray包可用于就地数组计算,但我在其文档页面上看不到任何修改项目的功能.还有Data.HashTable.IO,但它是一个无序的数据结构.

我想使用简单Data.IntMap.Strict(具有方便的lookupLE功能),但我不认为它会非常有效,因为它会在每次更改时生成一个新的地图.难道ST单子提高吗？

UPD:我还在CoreReview.SX上发布了程序的最终版本.

Answer 1

ram*_*ion 5

怎么样

采用splitAt划分数据集到第5000项和休息.
按频率排序前5000个项目(升序)
经过其余的
- 如果项目的频率高于已排序项目中的最低频率
- 从已排序的项目中删除最低频率项目
- 将新项目插入已排序项目的适当位置

然后,该过程变为有效线性,但如果对具有次线性min-delete和插入的已排序的5000个元素使用数据结构,则系数会得到改善.

例如,Data.Heap从heap包中使用:

import Data.List (foldl')
import Data.Maybe (fromJust)
import Data.Heap hiding (splitAt)

mostFreq :: Int -> [(String, Int)] -> [(String, Int)]
mostFreq n dataset = final
  where
    -- change our pairs from (String,Int) to (Int,String)
    pairs = map swap dataset
    -- get the first `n` pairs in one list, and the rest of the pairs in another
    (first, rest) = splitAt n pairs
    -- put all the first `n` pairs into a MinHeap
    start = fromList first :: MinHeap (Int, String)
    -- then run through the rest of the pairs
    stop = foldl' step start rest
    -- modifying the heap to replace its least frequent pair
    -- with the new pair if the new pair is more frequent
    step heap pair = if viewHead heap < Just pair
                       then insert pair (fromJust $ viewTail heap)
                       else heap
    -- turn our heap of (Int, String) pairs into a list of (String,Int) pairs
    final = map swap (toList stop)
    swap ~(a,b) = (b,a)

Run Code Online (Sandbox Code Playgroud)

@firegurafiku:堆或树是高效"可变的",在某种意义上,你创建一个新的(更新的)结构,它在子线性时间内(比链表更好)重用以前的大部分结构. (2认同)

归档时间：	10 年，4 月前
查看次数：	249 次
最近记录：	10 年，4 月前