埃里克森的答案可能就是那个提出这个问题的人所期待的答案.
您可以将N台计算机中的每台计算机用作哈希表中的存储区:
说实话,对于100亿个字符串,你可以在1台PC上合理地做到这一点.散列表可能占用80-120 GB的32位散列,具体取决于精确的散列表实现.如果您正在寻找一种有效的解决方案,那么您必须更具体地了解"机器"的含义,因为它取决于每个存储的存储量以及网络通信的相对成本.
Split the file into N pieces. On each machine, load as much of the piece into memory as you can, and sort the strings. Write these chunks to mass storage on that machine. On each machine, merge the chunks into a single stream, and then merge the stream from each machine into a stream that contains all of the strings in sorted order. Compare each string with the previous. If they are the same, it is a duplicate.