如何快速从Java中未排序的数组中获取前N个出现项？

Question

如何快速从Java中未排序的数组中获取前N个出现项？

mag*_*ang 5 java sorting algorithm performance

我试过两种方法.

使用HashMap计算每个项目的计数,然后导航地图

HashMap<Integer, Integer> doc_counts = new HashMap<Integer, Integer>();
for (int i = 0; i < p; ++i) {
    int doc = alld[i];
    Integer count = doc_counts.get(doc);
    if (null == count)
        count = 0;
    doc_counts.put(doc, count + 1);
}
// to now it cost 200ms already
for (Entry<Integer, Integer> item : doc_counts.entrySet()) {
    heapCheck(h, hsize, item.getKey(), item.getValue());    // heap sort top hsize items
}

Run Code Online (Sandbox Code Playgroud)

首先对数组进行排序,然后使用heap-sort来获得前N个.

Arrays.sort(alld, 0, p); // the sort costs about 160ms
int curr = alld[0];
int count = 0;
for(int i = 0; i < p; i++) {
    int doc = alld[i];
    if(doc == curr) {
        ++count;
    } else {
        ++nHits;
        //curr += base;
        heapCheck(h, hsize, curr, count);
        curr = doc;
        count = 1;
    }
}
//
// Handle the last document that was collected.
heapCheck(h, hsize, curr, count);

Run Code Online (Sandbox Code Playgroud)

对包含1,600,000个项目的数组进行测试表明,第二种方法花费大约170ms并且大部分时间花费在排序上(大约160ms),第一种方法花费200ms甚至只是将所有项目添加到HashMap.如何提高性能,找到更快的映射或排序函数或将其更改为并行函数以使用多线程？

Answer 1

小智 0

堆排序的时间复杂度为 O(n log n)，而将所有内容添加到 Hashmap 的时间复杂度为 O(n)，因此很可能由于 Hashmap 的大小调整/重新散列而导致性能受到恒定因素的影响。尝试指定较大的初始容量以避免过多的调整大小操作。

如果您不是不断替换映射中的整数，而是创建一个可变的“class Count {public int value;}”，然后在找到正确的实例后，增加它包含的计数，该怎么办？这将使地图查找次数减半。 (3认同)

归档时间：	12 年，2 月前
查看次数：	353 次
最近记录：	7 年，9 月前