作为一个练习,主要是为了我自己的娱乐,我正在实现一个回溯包装解析器.对此的灵感是我想更好地了解hygenic宏如何在类似algol的语言中工作(与你通常在其中找到的语法免费lisp方言相对应).因此,通过输入的不同传递可能会看到不同的语法,因此缓存的解析结果无效,除非我还存储语法的当前版本以及缓存的解析结果.(编辑:使用键值集合的结果是它们应该是不可变的,但我不打算公开接口以允许它们被更改,因此可变或不可变集合都可以)
问题是python dicts不能作为其他dicts的键.即使使用元组(正如我将要做的那样)也无济于事.
>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>
Run Code Online (Sandbox Code Playgroud)
我想它必须一直是元组.现在python标准库提供了我所需要的,collections.namedtuple具有非常不同的语法,但可以用作键.继续上述会议:
>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}
Run Code Online (Sandbox Code Playgroud)
好.但是我必须为我想要使用的规则中的每个可能的键组合创建一个类,这不是那么糟糕,因为每个解析规则确切地知道它使用了什么参数,因此可以同时定义该类作为解析规则的函数.
编辑:namedtuples 的另一个问题是它们是严格定位的.两个看起来应该不同的元组实际上可以是相同的:
>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False
Run Code Online (Sandbox Code Playgroud)
tl'dr:我如何获得dict可用作其他dicts的密钥的s?
对答案进行了一些攻击,这是我正在使用的更完整的解决方案.请注意,这需要做一些额外的工作,以使得生成的dicts在实际应用中模糊不变.当然,通过电话来解决它仍然很容易,dict.__setitem__(instance, key, value)但我们都是成年人.
class hashdict(dict):
"""
hashable dict implementation, suitable for use as a key into
other dicts.
>>> h1 = hashdict({"apples": 1, "bananas":2})
>>> h2 = hashdict({"bananas": 3, "mangoes": 5})
>>> h1+h2
hashdict(apples=1, bananas=3, mangoes=5)
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: hashdict(bananas=3, mangoes=5)
based on answers from
http://stackoverflow.com/questions/1151658/python-hashable-dicts
"""
def __key(self):
return tuple(sorted(self.items()))
def __repr__(self):
return "{0}({1})".format(self.__class__.__name__,
", ".join("{0}={1}".format(
str(i[0]),repr(i[1])) for i in self.__key()))
def __hash__(self):
return hash(self.__key())
def __setitem__(self, key, value):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def __delitem__(self, key):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def clear(self):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def pop(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def popitem(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def setdefault(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def update(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
# update is not ok because it mutates the object
# __add__ is ok because it creates a new object
# while the new object is under construction, it's ok to mutate it
def __add__(self, right):
result = hashdict(self)
dict.update(result, right)
return result
if __name__ == "__main__":
import doctest
doctest.testmod()
Run Code Online (Sandbox Code Playgroud)
Unk*_*own 60
这是制作可编辑字典的简便方法.请记住,出于显而易见的原因,在嵌入另一个字典后不要改变它们.
class hashabledict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))
Run Code Online (Sandbox Code Playgroud)
Ale*_*lli 58
Hashables应该是不可变的 - 不强制执行此操作但是在第一次使用dict作为键后不要改变dict,以下方法可行:
class hashabledict(dict):
def __key(self):
return tuple((k,self[k]) for k in sorted(self))
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
return self.__key() == other.__key()
Run Code Online (Sandbox Code Playgroud)
如果你需要改变你的dicts并且STILL想要将它们用作键,复杂性会爆发百倍 - 而不是说它无法完成,但我会等到一个非常具体的指示才能进入那令人难以置信的泥潭! - )
Ray*_*ger 32
使字典可用于您的目的所需的只是添加__hash__方法:
class Hashabledict(dict):
def __hash__(self):
return hash(frozenset(self))
Run Code Online (Sandbox Code Playgroud)
请注意,冻结集转换适用于所有字典(即,它不需要键可以排序).同样,对字典值没有限制.
如果有许多字典具有相同的键但具有不同的值,则必须将哈希值考虑在内.最快的方法是:
class Hashabledict(dict):
def __hash__(self):
return hash((frozenset(self), frozenset(self.itervalues())))
Run Code Online (Sandbox Code Playgroud)
这比frozenset(self.iteritems())两个原因更快.首先,该frozenset(self)步骤重用存储在字典中的哈希值,从而节省了不必要的调用hash(key).其次,使用itervalues将直接访问这些值,并避免每次使用项时使用多个内存分配器调用,以便在每次执行查找时在内存中形成新的多个键/值元组.
Obe*_*nne 23
给定的答案是可以的,但可以通过使用frozenset(...)而不是tuple(sorted(...))生成哈希来改进它们:
>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023
Run Code Online (Sandbox Code Playgroud)
性能优势取决于字典的内容,但在大多数情况下我测试过,散列的frozenset速度至少快2倍(主要是因为它不需要排序).
Mik*_*ham 11
一个相当干净,直接的实现是
import collections
class FrozenDict(collections.Mapping):
"""Don't forget the docstrings!!"""
def __init__(self, *args, **kwargs):
self._d = dict(*args, **kwargs)
def __iter__(self):
return iter(self._d)
def __len__(self):
return len(self._d)
def __getitem__(self, key):
return self._d[key]
def __hash__(self):
return hash(tuple(sorted(self._d.iteritems())))
Run Code Online (Sandbox Code Playgroud)
我一直回到这个话题......这是另一种变化.我对子类化dict添加__hash__方法感到不安; 几乎没有逃避dict是可变的问题,并且相信它们不会改变似乎是一个弱想法.所以我反而考虑构建一个基于内置类型的映射,它本身是不可变的.虽然tuple是一个明显的选择,但访问它中的值意味着一种排序和一个二等分; 这不是一个问题,但似乎并没有充分利用它所构建的类型的大部分功能.
如果将密钥,值对保密到一个frozenset怎么办?那将需要什么,它将如何工作?
第1部分,您需要一种对'item'进行编码的方式,即冻结集将主要通过其键处理它们; 我会为此做一个小子类.
import collections
class pair(collections.namedtuple('pair_base', 'key value')):
def __hash__(self):
return hash((self.key, None))
def __eq__(self, other):
if type(self) != type(other):
return NotImplemented
return self.key == other.key
def __repr__(self):
return repr((self.key, self.value))
Run Code Online (Sandbox Code Playgroud)
仅这一点就会让你在不可变映射的吐痰距离中:
>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])
Run Code Online (Sandbox Code Playgroud)
D'哦!不幸的是,当你使用set运算符并且元素相等但不是同一个对象时; 哪一个最终返回值是未定义的,我们将不得不经历更多的回转.
>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'
Run Code Online (Sandbox Code Playgroud)
然而,以这种方式查看值很麻烦,更糟糕的是,创建了许多中间集; 那不行!我们将创建一个'假的'键值对来绕过它:
class Thief(object):
def __init__(self, key):
self.key = key
def __hash__(self):
return hash(pair(self.key, None))
def __eq__(self, other):
self.value = other.value
return pair(self.key, None) == other
Run Code Online (Sandbox Code Playgroud)
这导致问题较少:
>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'
Run Code Online (Sandbox Code Playgroud)
这就是所有深刻的魔力; 剩下的就是将它全部包装成具有类似dict 的界面的东西.由于我们是子类化frozenset,具有非常不同的接口,因此有很多方法; 我们从中得到了一些帮助collections.Mapping,但大多数工作都覆盖了frozenset像dicts这样的版本的方法,而是:
class FrozenDict(frozenset, collections.Mapping):
def __new__(cls, seq=()):
return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
def __getitem__(self, key):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
raise KeyError(key)
def __eq__(self, other):
if not isinstance(other, FrozenDict):
return dict(self.iteritems()) == other
if len(self) != len(other):
return False
for key, value in self.iteritems():
try:
if value != other[key]:
return False
except KeyError:
return False
return True
def __hash__(self):
return hash(frozenset(self.iteritems()))
def get(self, key, default=None):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
return default
def __iter__(self):
for item in frozenset.__iter__(self):
yield item.key
def iteritems(self):
for item in frozenset.__iter__(self):
yield (item.key, item.value)
def iterkeys(self):
for item in frozenset.__iter__(self):
yield item.key
def itervalues(self):
for item in frozenset.__iter__(self):
yield item.value
def __contains__(self, key):
return frozenset.__contains__(self, pair(key, None))
has_key = __contains__
def __repr__(self):
return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
@classmethod
def fromkeys(cls, keys, value=None):
return cls((key, value) for key in keys)
Run Code Online (Sandbox Code Playgroud)
最终,它确实回答了我自己的问题:
>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5
Run Code Online (Sandbox Code Playgroud)
@Unknown 接受的答案以及@AlexMartelli 的答案都可以很好地工作,但仅限于以下限制条件:
hash(hashabledict({'a':[1,2]}))将提高TypeError.hash(hashabledict({'a':'a', 1:1}))将提高TypeError.frozenset((1,2,3))and frozenset((4,5,6)),则它们在两个方向上比较不相等。因此,用这样的键对字典的项目进行排序可能会导致任意顺序,因此将违反相等对象必须具有相同哈希值的规则。@ObenSonne 提供的更快的答案解除了约束 2 和 3,但仍受约束 1 的约束(值必须是可散列的)。
@RaymondHettinger 的更快但答案解除了所有 3 个约束,因为它不包括.values()在哈希计算中。但是,只有在以下情况下其性能良好:
.keys()。如果不满足这个条件,散列函数仍然有效,但可能会导致过多的冲突。例如,在所有字典都是从网站模板生成的极端情况下(字段名称作为键,用户输入作为值),键将始终相同,并且哈希函数将为所有输入返回相同的值. 结果,依赖于这种哈希函数的哈希表在检索项目时将变得和列表一样慢(O(N)而不是O(1))。
我认为即使违反了我上面列出的所有 4 个约束,以下解决方案也能很好地工作。它还有一个额外的优势,它不仅可以对字典进行哈希处理,还可以对任何容器进行哈希处理,即使它们具有嵌套的可变容器。
我非常感谢对此的任何反馈,因为到目前为止我只对此进行了轻微的测试。
# python 3.4
import collections
import operator
import sys
import itertools
import reprlib
# a wrapper to make an object hashable, while preserving equality
class AutoHash:
# for each known container type, we can optionally provide a tuple
# specifying: type, transform, aggregator
# even immutable types need to be included, since their items
# may make them unhashable
# transformation may be used to enforce the desired iteration
# the result of a transformation must be an iterable
# default: no change; for dictionaries, we use .items() to see values
# usually transformation choice only affects efficiency, not correctness
# aggregator is the function that combines all items into one object
# default: frozenset; for ordered containers, we can use tuple
# aggregator choice affects both efficiency and correctness
# e.g., using a tuple aggregator for a set is incorrect,
# since identical sets may end up with different hash values
# frozenset is safe since at worst it just causes more collisions
# unfortunately, no collections.ABC class is available that helps
# distinguish ordered from unordered containers
# so we need to just list them out manually as needed
type_info = collections.namedtuple(
'type_info',
'type transformation aggregator')
ident = lambda x: x
# order matters; first match is used to handle a datatype
known_types = (
# dict also handles defaultdict
type_info(dict, lambda d: d.items(), frozenset),
# no need to include set and frozenset, since they are fine with defaults
type_info(collections.OrderedDict, ident, tuple),
type_info(list, ident, tuple),
type_info(tuple, ident, tuple),
type_info(collections.deque, ident, tuple),
type_info(collections.Iterable, ident, frozenset) # other iterables
)
# hash_func can be set to replace the built-in hash function
# cache can be turned on; if it is, cycles will be detected,
# otherwise cycles in a data structure will cause failure
def __init__(self, data, hash_func=hash, cache=False, verbose=False):
self._data=data
self.hash_func=hash_func
self.verbose=verbose
self.cache=cache
# cache objects' hashes for performance and to deal with cycles
if self.cache:
self.seen={}
def hash_ex(self, o):
# note: isinstance(o, Hashable) won't check inner types
try:
if self.verbose:
print(type(o),
reprlib.repr(o),
self.hash_func(o),
file=sys.stderr)
return self.hash_func(o)
except TypeError:
pass
# we let built-in hash decide if the hash value is worth caching
# so we don't cache the built-in hash results
if self.cache and id(o) in self.seen:
return self.seen[id(o)][0] # found in cache
# check if o can be handled by decomposing it into components
for typ, transformation, aggregator in AutoHash.known_types:
if isinstance(o, typ):
# another option is:
# result = reduce(operator.xor, map(_hash_ex, handler(o)))
# but collisions are more likely with xor than with frozenset
# e.g. hash_ex([1,2,3,4])==0 with xor
try:
# try to frozenset the actual components, it's faster
h = self.hash_func(aggregator(transformation(o)))
except TypeError:
# components not hashable with built-in;
# apply our extended hash function to them
h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
if self.cache:
# storing the object too, otherwise memory location will be reused
self.seen[id(o)] = (h, o)
if self.verbose:
print(type(o), reprlib.repr(o), h, file=sys.stderr)
return h
raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))
def __hash__(self):
return self.hash_ex(self._data)
def __eq__(self, other):
# short circuit to save time
if self is other:
return True
# 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
# 2) any other situation => lhs.__eq__ will be called first
# case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
# => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
# case 2. neither side is a subclass of the other; self is lhs
# => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
# case 3. neither side is a subclass of the other; self is rhs
# => we can't compare to another type, and the other side already tried and failed;
# we should return False, but NotImplemented will have the same effect
# any other case: we won't reach the __eq__ code in this class, no need to worry about it
if isinstance(self, type(other)): # identifies case 1
return self._data == other._data
else: # identifies cases 2 and 3
return NotImplemented
d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))
d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
49537 次 |
| 最近记录: |