Tre*_*ent 10 python indexing search dictionary associative-array
我想知道如何从python字典中对键执行某种索引.这本字典大约有.400,000项,所以我试图避免线性搜索.
基本上,我试图找到是否userinput在任何dict键内.
for keys in dict:
if userinput in keys:
DoSomething()
break
Run Code Online (Sandbox Code Playgroud)
这将是我想要做的一个例子.有没有办法以更直接的方式搜索,没有循环?或者什么是更有效的方式.
澄清:该userinput是不完全的关键是什么,例如userinput可能log的,而关键是logfile
编辑:可以在搜索之前完成的任何列表/缓存创建,预处理或组织都是可接受的.唯一需要快速的是搜索密钥.
不。在字典键中搜索字符串的唯一方法是查找每个键。像你所建议的那样是使用字典来做到这一点的唯一方法。
但是,如果您有 400,000 条记录并且想要加快搜索速度,我建议使用 SQLite 数据库。那你就可以说SELECT * FROM TABLE_NAME WHERE COLUMN_NAME LIKE '%userinput%';。请在此处查看 Python 的 sqlite3 模块的文档。
另一种选择是使用生成器表达式,因为它们几乎总是比等效的 for 循环更快。
filteredKeys = (key for key in myDict.keys() if userInput in key)
for key in filteredKeys:
doSomething()
Run Code Online (Sandbox Code Playgroud)
编辑:如果正如您所说,您不关心一次性成本,请使用数据库。SQLite 应该可以完美地完成你想要的事情。
我做了一些基准测试,令我惊讶的是,这个简单的算法实际上比使用列表推导式的版本快两倍,比 SQLite 驱动的版本快六倍。鉴于这些结果,我必须选择 @Mark Byers 并推荐一个 Trie。我在下面发布了基准,以防有人想尝试一下。
import random, string, os
import time
import sqlite3
def buildDict(numElements):
aDict = {}
for i in xrange(numElements-10):
aDict[''.join(random.sample(string.letters, 6))] = 0
for i in xrange(10):
aDict['log'+''.join(random.sample(string.letters, 3))] = 0
return aDict
def naiveLCSearch(aDict, searchString):
filteredKeys = [key for key in aDict.keys() if searchString in key]
return filteredKeys
def naiveSearch(aDict, searchString):
filteredKeys = []
for key in aDict:
if searchString in key:
filteredKeys.append(key)
return filteredKeys
def insertIntoDB(aDict):
conn = sqlite3.connect('/tmp/dictdb')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS BLAH')
c.execute('CREATE TABLE BLAH (KEY TEXT PRIMARY KEY, VALUE TEXT)')
for key in aDict:
c.execute('INSERT INTO BLAH VALUES(?,?)',(key, aDict[key]))
return conn
def dbSearch(conn):
cursor = conn.cursor()
cursor.execute("SELECT KEY FROM BLAH WHERE KEY GLOB '*log*'")
return [record[0] for record in cursor]
if __name__ == '__main__':
aDict = buildDict(400000)
conn = insertIntoDB(aDict)
startTimeNaive = time.time()
for i in xrange(3):
naiveResults = naiveSearch(aDict, 'log')
endTimeNaive = time.time()
print 'Time taken for 3 iterations of naive search was', (endTimeNaive-startTimeNaive), 'and the average time per run was', (endTimeNaive-startTimeNaive)/3.0
startTimeNaiveLC = time.time()
for i in xrange(3):
naiveLCResults = naiveLCSearch(aDict, 'log')
endTimeNaiveLC = time.time()
print 'Time taken for 3 iterations of naive search with list comprehensions was', (endTimeNaiveLC-startTimeNaiveLC), 'and the average time per run was', (endTimeNaiveLC-startTimeNaiveLC)/3.0
startTimeDB = time.time()
for i in xrange(3):
dbResults = dbSearch(conn)
endTimeDB = time.time()
print 'Time taken for 3 iterations of DB search was', (endTimeDB-startTimeDB), 'and the average time per run was', (endTimeDB-startTimeDB)/3.0
os.remove('/tmp/dictdb')
Run Code Online (Sandbox Code Playgroud)
根据记录,我的结果是:
Time taken for 3 iterations of naive search was 0.264658927917 and the average time per run was 0.0882196426392
Time taken for 3 iterations of naive search with list comprehensions was 0.403481960297 and the average time per run was 0.134493986766
Time taken for 3 iterations of DB search was 1.19464492798 and the average time per run was 0.398214975993
Run Code Online (Sandbox Code Playgroud)
所有时间均以秒为单位。