HashSet 代码的意外运行时间

Question

HashSet 代码的意外运行时间

scd*_*vad 27 java performance for-loop hashset

所以最初，我有这个代码：

import java.util.*;

public class sandbox {
    public static void main(String[] args) {
        HashSet<Integer> hashSet = new HashSet<>();
        for (int i = 0; i < 100_000; i++) {
            hashSet.add(i);
        }

        long start = System.currentTimeMillis();

        for (int i = 0; i < 100_000; i++) {
            for (Integer val : hashSet) {
                if (val != -1) break;
            }

            hashSet.remove(i);
        }

        System.out.println("time: " + (System.currentTimeMillis() - start));
    }
}

Run Code Online (Sandbox Code Playgroud)

在我的计算机上运行嵌套的 for 循环大约需要 4 秒，我不明白为什么要花这么长时间。外循环运行 100,000 次，内循环应该运行 1 次（因为 hashSet 的任何值永远不会是 -1）并且从 HashSet 中删除一个项目是 O(1)，所以应该有大约 200,000 次操作。如果一秒钟内通常有 100,000,000 次操作，为什么我的代码需要 4 秒才能运行？

此外，如果该行hashSet.remove(i);被注释掉，代码只需要 16 毫秒。如果内部 for 循环被注释掉（但没有注释掉hashSet.remove(i);），代码只需要 8ms。

Answer 1

apa*_*gin 32

您已经创建了的边缘用例HashSet，其中算法降级为二次复杂度。

这是需要很长时间的简化循环：

for (int i = 0; i < 100_000; i++) {
    hashSet.iterator().next();
    hashSet.remove(i);
}

Run Code Online (Sandbox Code Playgroud)

async-profiler显示几乎所有时间都花在java.util.HashMap$HashIterator()构造函数中：

    HashIterator() {
        expectedModCount = modCount;
        Node<K,V>[] t = table;
        current = next = null;
        index = 0;
        if (t != null && size > 0) { // advance to first entry
--->        do {} while (index < t.length && (next = t[index++]) == null);
        }
    }

Run Code Online (Sandbox Code Playgroud)

突出显示的行是一个线性循环，用于搜索哈希表中的第一个非空桶。

由于Integer有琐碎的hashCode事情（即hashCode等于数字本身），结果证明连续整数主要占据哈希表中的连续桶：数字0进入第一个桶，数字1进入第二个桶，以此类推。

现在删除从 0 到 99999 的连续数字。在最简单的情况下（当存储桶包含单个键时），删除键的实现是将存储桶数组中的相应元素清空。请注意，表在删除后不会被压缩或重新散列。

因此，从存储桶数组的开头删除的键越多HashIterator，查找第一个非空存储桶所需的时间就越长。

尝试从另一端移除密钥：

hashSet.remove(100_000 - i);

Run Code Online (Sandbox Code Playgroud)

算法将变得更快！

归档时间：	5 年，8 月前
查看次数：	735 次
最近记录：	5 年，8 月前