python中对象的高效数据结构,用于基于任何对象成员变量进行查找

kad*_*dam 10 python data-structures

我需要存储具有数字(> 2)整数成员变量的对象,并使用任何成员变量作为搜索键进行快速查找.

为了便于说明,假设对象是3个整数的元组,我需要使用元组的任何元素作为此类元组列表中的键进行快速查找:

collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]
Run Code Online (Sandbox Code Playgroud)

查找将是:

collection.lookup_by_first_element(1) # Return (1, 200, 9)
collection.lookup_by_second_element(300) # Return (2, 300, 8)
collection.lookup_by_third_element(250) # Return None
Run Code Online (Sandbox Code Playgroud)

我需要查找快速/高效.到目前为止,我最好的选择是使用内存sqlite数据库,其中三列为三个元组,并在所有三列上放置索引.

搜索树也可以,但是那些有一个查找键,我不知道如何根据多个键进行查找.你会怎么做?

sen*_*rle 10

这是一个简单的解决方案.您可以轻松地将其放入类中并提供更整洁的界面.

>>> from collections import defaultdict
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> keyed_dict = defaultdict(list)
>>> for tup in collection:
...     for i, e in enumerate(tup):
...         keyed_dict[(i, e)].append(tup)
... 
>>> keyed_dict[(1, 300)]
[(2, 300, 8)]
Run Code Online (Sandbox Code Playgroud)

更新:

对于它的价值,上述方法比查找的numpy解决方案更快:

from timeit import timeit

setup_code = '''
import numpy

clen = {0}  # use .format() to fill this value
collection = [(n, (n + 1) * 100, clen - n) for n in xrange(clen)]

npcollection = numpy.array(collection)

def numpy_lookup(collection, column, value):
    if numpy.any(collection[:, column] == value): return collection[collection[:, column] == value]
    return 'None'

keyed_dict = dict()
for tup in collection:
    for i, e in enumerate(tup):
        keyed_dict[i, e] = tup
'''

for e in range(5):
    setup = setup_code.format(str(10 ** e))
    kd_stmt = '[keyed_dict[0, n] for n in range({0})]'.format(str(10 ** e))
    np_stmt = '[numpy_lookup(npcollection, 0, n) for n in range({0})]'.format(str(10 ** e))
    print 'using keyed_dict: ',
    print timeit(stmt=kd_stmt, setup=setup, number=10)
    print 'using numpy_lookup: ',
    print timeit(stmt=np_stmt.format(str(10 ** e)), setup=setup, number=10)
Run Code Online (Sandbox Code Playgroud)

输出:

using keyed_dict:  1.09672546387e-05
using numpy_lookup:  0.000250101089478
using keyed_dict:  3.00407409668e-05
using numpy_lookup:  0.00193691253662
using keyed_dict:  0.000190019607544
using numpy_lookup:  0.0199580192566
using keyed_dict:  0.00195384025574
using numpy_lookup:  0.317503929138
using keyed_dict:  0.0319399833679
using numpy_lookup:  15.0127439499
Run Code Online (Sandbox Code Playgroud)


dr *_*bob 6

senderle是对的(我赞成),但我想详细说明(而不只是在评论中).

字典查找是O(1)并且非常快(基本上你的密钥变成了一个散列,它解决了内存中的特定位置,可以立即从中检索数据).

相反,通过搜索数组来查找值至少为O(N),因此对于较大的数组,找到正确的值需要更长的时间.(例如,你必须筛选所有N个键,找到正确的键,然后才能返回元组.)

因此,如果您的密钥是独一无二的,那么创建一个基于您可能用于查找的每个密钥编制索引的大型字典是有意义的.当然,你必须代表m个项目的每个元组(在你的情况下m = 3),在字典中m次,而在单个数组中它只需要在数组中表示一次.

所以你想定义一个Collection类

class Collection(object):
    def __init__(self, collection):
        self.collection_dict = dict()
        for tup in collection:
            for i, v in enumerate(tup):
                self.collection_dict[(i, v)] = tup
    def lookup_by_first_element(self, e):
        return self.collection_dict.get((0, e), None)
    def lookup_by_second_element(self, e):
        return self.collection_dict.get((1, e), None)
    def lookup_by_third_element(self, e):
        return self.collection_dict.get((2, e), None)


collection = [(1, 200, 9), 
              (2, 300, 8), 
              (3, 400, 7)]
c = Collection(collection)
Run Code Online (Sandbox Code Playgroud)

内部c.collection_dict是:

{(0, 1): (1, 200, 9),
 (0, 2): (2, 300, 8),
 (0, 3): (3, 400, 7),
 (1, 200): (1, 200, 9),
 (1, 300): (2, 300, 8),
 (1, 400): (3, 400, 7),
 (2, 7): (3, 400, 7),
 (2, 8): (2, 300, 8),
 (2, 9): (1, 200, 9)}
Run Code Online (Sandbox Code Playgroud)

您的查找按您的要求工作

>>> c.lookup_by_first_element(1) == (1, 200, 9)
True
>>> c.lookup_by_second_element(300) == (2, 300, 8)
True
>>> c.lookup_by_third_element(250) is None
True
Run Code Online (Sandbox Code Playgroud)


riz*_*iza -1

使用numpy:

>>> import numpy as np
>>> collection = [(1, 200, 9),
...               (2, 300, 8),
...               (3, 400, 7)]
>>> collection = np.array(collection)
>>> def f(d, c, v):
...     # d: collection, c: column, v: value
...     if np.any(d[:, c]==v): return d[d[:, c]==v]
...     return 'None'
...
>>> f(collection, 0, 1)
array([[  1, 200,   9]])
>>> f(collection, 1, 300)
array([[  2, 300,   8]])
>>> f(collection, 2, 250)
'None'
Run Code Online (Sandbox Code Playgroud)

  • 很抱歉成为批评者,但这比使用字典慢得多。请参阅我更新的答案。 (2认同)