搜索词典键python

Tre*_*ent 10 python indexing search dictionary associative-array

我想知道如何从python字典中对键执行某种索引.这本字典大约有.400,000项,所以我试图避免线性搜索.

基本上,我试图找到是否userinput在任何dict键内.

for keys in dict:
    if userinput in keys:
        DoSomething()
        break
Run Code Online (Sandbox Code Playgroud)

这将是我想要做的一个例子.有没有办法以更直接的方式搜索,没有循环?或者什么是更有效的方式.

澄清:userinput是不完全的关键是什么,例如userinput可能log的,而关键是logfile

编辑:可以在搜索之前完成的任何列表/缓存创建,预处理或组织都是可接受的.唯一需要快速的是搜索密钥.

Mar*_*ers 6

如果您只需要找到以前缀开头的键,那么您可以使用trie.存在更复杂的数据结构,用于查找在其中任何位置包含子字符串的键,但是它们占用了更多的空间来存储,因此它是一个时空权衡.


Chi*_*chi 3

不。在字典键中搜索字符串的唯一方法是查找每个键。像你所建议的那样是使用字典来做到这一点的唯一方法。

但是,如果您有 400,000 条记录并且想要加快搜索速度,我建议使用 SQLite 数据库。那你就可以说SELECT * FROM TABLE_NAME WHERE COLUMN_NAME LIKE '%userinput%';请在此处查看 Python 的 sqlite3 模块的文档。

另一种选择是使用生成器表达式,因为它们几乎总是比等效的 for 循环更快。

filteredKeys = (key for key in myDict.keys() if userInput in key)
for key in filteredKeys:
    doSomething()
Run Code Online (Sandbox Code Playgroud)

编辑:如果正如您所说,您不关心一次性成本,请使用数据库。SQLite 应该可以完美地完成你想要的事情。

我做了一些基准测试,令我惊讶的是,这个简单的算法实际上比使用列表推导式的版本快两倍,比 SQLite 驱动的版本快六倍鉴于这些结果,我必须选择 @Mark Byers 并推荐一个 Trie。我在下面发布了基准,以防有人想尝试一下。

import random, string, os
import time
import sqlite3

def buildDict(numElements):
    aDict = {}
    for i in xrange(numElements-10):
        aDict[''.join(random.sample(string.letters, 6))] = 0

    for i in xrange(10):
        aDict['log'+''.join(random.sample(string.letters, 3))] = 0

    return aDict

def naiveLCSearch(aDict, searchString):
    filteredKeys = [key for key in aDict.keys() if searchString in key]
    return filteredKeys

def naiveSearch(aDict, searchString):
    filteredKeys = []
    for key in aDict:
        if searchString in key: 
            filteredKeys.append(key)
    return filteredKeys

def insertIntoDB(aDict):
    conn = sqlite3.connect('/tmp/dictdb')
    c = conn.cursor()
    c.execute('DROP TABLE IF EXISTS BLAH')
    c.execute('CREATE TABLE BLAH (KEY TEXT PRIMARY KEY, VALUE TEXT)')
    for key in aDict:
        c.execute('INSERT INTO BLAH VALUES(?,?)',(key, aDict[key]))
    return conn

def dbSearch(conn):
    cursor = conn.cursor()
    cursor.execute("SELECT KEY FROM BLAH WHERE KEY GLOB '*log*'")
    return [record[0] for record in cursor]

if __name__ == '__main__':
    aDict = buildDict(400000)
    conn = insertIntoDB(aDict)
    startTimeNaive = time.time()
    for i in xrange(3):
        naiveResults = naiveSearch(aDict, 'log')
    endTimeNaive = time.time()
    print 'Time taken for 3 iterations of naive search was', (endTimeNaive-startTimeNaive), 'and the average time per run was', (endTimeNaive-startTimeNaive)/3.0

    startTimeNaiveLC = time.time()
    for i in xrange(3):
        naiveLCResults = naiveLCSearch(aDict, 'log')
    endTimeNaiveLC = time.time()
    print 'Time taken for 3 iterations of naive search with list comprehensions was', (endTimeNaiveLC-startTimeNaiveLC), 'and the average time per run was', (endTimeNaiveLC-startTimeNaiveLC)/3.0

    startTimeDB = time.time()
    for i in xrange(3):
        dbResults = dbSearch(conn)
    endTimeDB = time.time()
    print 'Time taken for 3 iterations of DB search was', (endTimeDB-startTimeDB), 'and the average time per run was', (endTimeDB-startTimeDB)/3.0


    os.remove('/tmp/dictdb')
Run Code Online (Sandbox Code Playgroud)

根据记录,我的结果是:

Time taken for 3 iterations of naive search was 0.264658927917 and the average time per run was 0.0882196426392
Time taken for 3 iterations of naive search with list comprehensions was 0.403481960297 and the average time per run was 0.134493986766
Time taken for 3 iterations of DB search was 1.19464492798 and the average time per run was 0.398214975993
Run Code Online (Sandbox Code Playgroud)

所有时间均以秒为单位。

  • 我会将上面的内容缩短为“但是,如果您有 400,000 条记录,请使用数据库。” (4认同)