And*_*ehi 30 c# memory algorithm collections
我的问题不常见.让我们想象几十亿字符串.字符串通常少于15个字符.在此列表中,我需要找出唯一元素的数量.
首先,我应该使用什么对象?你不应该忘记,如果我添加一个新的元素,我必须检查它是否已经存在于列表中.这在一开始并不是问题,但在几百万字后,它确实会减慢这个过程.
这就是为什么我认为Hashtable是这项任务的理想选择,因为检查列表理想情况下只有log(1).不幸的是.net中的单个对象只能是2GB.
下一步将实现一个包含2GB哈希表列表的自定义哈希表.
我想知道也许你们中的一些人知道更好的解决方案.(电脑规格极高.)
D.S*_*ley 29
我会跳过数据结构练习并只使用SQL数据库.为什么要编写另一个必须分析和调试的自定义数据结构,只需使用数据库.他们非常擅长回答这样的问题.
Lee*_*Lee 23
我会考虑Trie或Directed非循环字图,它应该比哈希表更节省空间.对字符串成员资格的测试将是O(len),其中len是输入字符串的长度,这可能与字符串散列函数相同.
This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.
If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.
This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.