在Python中散列不可变字典

Question

在Python中散列不可变字典

wks*_*rtz 15 python hash dictionary immutability python-3.2

简短版本:作为无序项目字典实现的多集合的最佳散列算法是什么？

我正在尝试将一个不可变的multiset(这是一个包或其他语言的多重集合:像一个数学集,除了它可以容纳多个元素)作为字典实现.我已经创建了标准库类的子类collections.Counter,类似于这里的建议:Python hashable dicts,它建议像这样的哈希函数:

class FrozenCounter(collections.Counter):
    # ...
    def __hash__(self):
        return hash(tuple(sorted(self.items())))

Run Code Online (Sandbox Code Playgroud)

创建完整的项目元组会占用大量内存(相对于使用生成器而言),并且哈希将在我的应用程序的内存密集型部分中发生.更重要的是,我的字典键(multiset元素)可能不会是可订购的.

我正在考虑使用这个算法:

def __hash__(self):
    return functools.reduce(lambda a, b: a ^ b, self.items(), 0)

Run Code Online (Sandbox Code Playgroud)

我想使用按位XOR意味着顺序与散列值无关,与元组的散列不同？我想我可以在我的数据的无序流序列上半实现Python元组散列算法.请参阅https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h(在页面中搜索"hash"一词) - 但我几乎不知道有足够的C来阅读它.

思考？建议？谢谢.

(如果你想知道为什么我要乱用多线程:我的问题的输入数据是多集的集合,并且在每组多集中,每个多集必须是唯一的.我正在截止日期工作而且我不是一个经验丰富的编码员,所以我想避免在可能的情况下发明新的算法.似乎最恐怖的方式来确保我有一堆东西的独特之处就是把它们放进去set(),但事情必须是哈希的.)

我从评论中收集到了什么

@marcin和@senderle都给出了相同的答案:使用hash(frozenset(self.items())).这是有道理的,因为items()"视图"是设置的.@marcin是第一个,但我给@senderle打了一个复选标记,因为对不同解决方案的大O运行时间进行了很好的研究.@marcin还提醒我要包含一个__eq__方法 - 但是继承自的方法dict会很好用.这就是我实现所有内容的方式 - 欢迎基于此代码的进一步意见和建议:

class FrozenCounter(collections.Counter):
    # Edit: A previous version of this code included a __slots__ definition.
    # But, from the Python documentation: "When inheriting from a class without
    # __slots__, the __dict__ attribute of that class will always be accessible,
    # so a __slots__ definition in the subclass is meaningless."
    # http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots
    # ...
    def __hash__(self):
        "Implements hash(self) -> int"
        if not hasattr(self, '_hash'):
            self._hash = hash(frozenset(self.items()))
        return self._hash

Run Code Online (Sandbox Code Playgroud)

Answer 1

sen*_*rle 13

由于字典是不可变的,因此您可以在创建字典时创建哈希并直接返回.我的建议是创建一个frozensetfrom items(在3+; iteritems在2.7中),哈希它,并存储哈希.

提供一个明确的例子:

>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())
frozenset([(3, 2), (1, 3), (4, 1), (2, 1)])
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()))
-3071743570178645657
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems()))
-6559486438209652990

Run Code Online (Sandbox Code Playgroud)

为了澄清为什么我更喜欢一个frozenset排序项的元组:a frozenset不必对项进行排序(因为它们通过内存中的哈希值稳定排序),因此初始哈希应该在O(n)时间内完成而不是O(n log n)时间.这可以从frozenset_hash和set_next实现中看出.

归档时间：	13 年，6 月前
查看次数：	1958 次
最近记录：	6 年，2 月前