如何在Python3中组合哈希码?

kev*_*rpe 5 hashcode python-3.x

我更熟悉从子类的超类构建复杂/组合的哈希码的“ Java方式”。Python 3是否有更好/不同/首选的方式?(我在Google上找不到关于Python3的任何东西。)

class Superclass:
    def __init__(self, data):
        self.__data = data

    def __hash__(self):
        return hash(self.__data)

class Subclass(Superclass):
    def __init__(self, data, more_data):
        super().__init__(data)
        self.__more_data = more_data

    def __hash__(self):
        # Just a guess...
        return hash(super()) + 31 * hash(self.__more_data)
Run Code Online (Sandbox Code Playgroud)

为了简化这个问题,请假设self.__dataself.__more_data它们是简单的可哈希数据,例如strint

Mar*_*ers 10

The easiest way to produce good hashes is to put your values in a standard hashable Python container, then hash that. This includes combining hashes in subclasses. I'll explain why, and then how.

Base requirements

First things first:

  • If two objects test as equal, then they MUST have the same hash value
  • Objects that have a hash, MUST produce the same hash over time.

Only when you follow those two rules can your objects safely be used in dictionaries and sets. The hash not changing is what keeps dictionaries and sets from breaking, as they use the hash to pick a storage location, and won't be able to locate the object again given another object that tests equal if the hash changed.

Note that it doesn’t even matter if the two objects are of different types; True == 1 == 1.0 so all have the same hash and will all count as the same key in a dictionary.

What makes a good hash value

You'd want to combine the components of your object value in ways that will produce, as much as possible, different hashes for different values. That includes things like ordering and specific meaning, so that two attributes that represent different aspects of your value, but that can hold the same type of Python objects, still result in different hashes, most of the time.

Note that it's fine if two objects that represent different values (won't test equal) have equal hashes. Reusing a hash value won't break sets or dictionaries. However, if a lot of different object values produce equal hashes then that reduces their efficiency, as you increase the likelihood of collisions. Collisions require collision resolution and collision resolution takes more time, so much so that you can use denial of service attacks on servers with predictable hashing implementations) (*).

So you want a nice wide spread of possible hash values.

Pitfalls to watch out for

The documentation for the object.__hash__ method includes some advice on how to combine values:

The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

but only using XOR will not produce good hash values, not when the values whose hashes that you XOR together can be of the same type but have different meaning depending on the attribute they've been assigned to. To illustrate with an example:

>>> class Foo:
...     def __init__(self, a, b):
...         self.a = a
...         self.b = b
...     def __hash__(self):
...         return hash(self.a) ^ hash(self.b)
...
>>> hash(Foo(42, 'spam')) == hash(Foo('spam', 42))
True
Run Code Online (Sandbox Code Playgroud)

Because the hashes for self.a and self.b were just XOR-ed together, we got the same hash value for either order, and so effectively halving the number of usable hashes. Do so with more attributes and you cut the number of unique hashes down rapidly. So you may want to include a bit more information in the hash about each attribute, if the same values can be used in different elements that make up the hash.

Next, know that while Python integers are unbounded, hash values are not. That is to say, hashes values have a finite range. From the same documentation:

Note: hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.

This means that if you used addition or multiplication or other operations that increase the number of bits needed to store the hash value, you will end up losing the upper bits and so reduce the number of different hash values again.

Next, if you combine multiple hashes with XOR that already have a limited range, chances are you end up with an even smaller number of possible hashes. Try XOR-ing the hashes of 1000 random integers in the range 0-10, for an extreme example.

Hashing, the easy way

Python developers have long since wrestled with the above pitfalls, and solved it for the standard library types. Use this to your advantage. Put your values in a tuple, then hash that tuple.

Python tuples use a simplified version of the xxHash algorithm to capture order information and to ensure a broad range of hash values. So for different attributes, you can capture the different meanings by giving them different positions in a tuple, then hashing the tuple:

def __hash__(self):
    return hash((self.a, self.b))
Run Code Online (Sandbox Code Playgroud)

This ensures you get unique hash values for unique orderings.

If you are subclassing something, put the hash of the parent implementation into one of the tuple positions:

def __hash__(self):
    return hash((super().__hash__(), self.__more_data))
Run Code Online (Sandbox Code Playgroud)

Hashing a hash value does reduce it to a 60-bit or 30-bit value (on 32-bit or 64-bit platforms, respectively), but that's not a big problem when combined with other values in a tuple. If you are really concerned about this, put None in the tuple as a placeholder and XOR the parent hash (so super().__hash__() ^ hash((None, self.__more_data))). But this is overkill, really.

If you have a multiple values whose relative order doesn't matter, and don't want to XOR these all together one by one, consider using a frozenset() object for fast processing, combined with a collections.Counter() object if values are not meant to be unique. The frozenset() hash operation accounts for small hash ranges by reshuffling the bits in hashes first:

# unordered collection hashing
from collections import Counter
hash(frozenset(Counter(...).items()))
Run Code Online (Sandbox Code Playgroud)

Consider using dataclasses

For most objects you write __hash__ functions for, you actually want to be using a dataclass generated class:

from dataclasses import dataclass
from typing import Union

@dataclass(frozen=True)
class Foo:
    a: Union[int, str]
    b: Union[int, str]
Run Code Online (Sandbox Code Playgroud)

Dataclasses are given a sane __hash__ implementation when frozen=True or unsafe_hash=True, using a tuple() of all the field values.


(*) Python protects your code against such hash collision attacks by using a process-wide random hash seed to hash strings, bytes and datetime objects.


Tho*_*ers 5

python文档建议您使用xor组合哈希:

唯一需要的属性是比较相等的对象具有相同的哈希值;建议以某种方式将哈希值混合在一起(例如,使用异或),以将对象的组成部分也用作对象比较的一部分。

由于以下原因,我还建议在加法和乘法上进行异或运算:

注意

hash()将对象的自定义__hash__()方法返回的值截断为的大小Py_ssize_t。在64位版本上通常为8个字节,在32位版本上通常为4个字节。如果对象__hash__()必须在不同位大小的版本上互操作,请确保检查所有受支持版本的宽度。一种简单的方法是使用python -c "import sys; print(sys.hash_info.width)

顺便说一下,此文档与python 2.7和python 3.4相同。