在Python中的文件中存储巨大的哈希表

oob*_*boo 4 python hashtable file

嘿.我有一个我想要记忆的功能,但它有太多可能的值.有没有方便的方法将值存储在文本文件中并从中读取?例如,在文本文件中存储预先计算的素数列表,最多10 ^ 9?我知道从文本文件中读取的速度很慢,但如果数据量非常大,则没有其他选择.谢谢!

Ale*_*lli 11

对于最多的素数列表10**9,为什么需要哈希?KEYS会是什么?!听起来像是一个简单,直接的二进制文件的绝佳机会!根据素数定理,有关于10**9/ln(10**9)这样的素数 - 即5000万或更少.每个素数为4个字节,仅为200 MB或更少 - 非常适合array.array("L")其等方法fromfile(请参阅文档).在许多情况下,你实际上可以将200 MB全部吸入内存中,但是,最坏的情况是,你可以得到一些(例如通过mmapfromstring方法array.array),在那里进行二进制搜索(例如通过bisect)等等.

当你需要一个巨大的键值存储 - 千兆字节,而不是一个微不足道的200 MB! - ) - 我曾经推荐shelve但是在巨大的货架(性能,可靠性等)令人不快的现实生活经验之后,我目前推荐一个相反,数据库引擎 - sqlite很好,附带Python,PostgreSQL甚至更好,非关系型,如CouchDB可以更好,等等.


小智 6

您可以使用shelve模块在文件中存储类似结构的字典.从Python文档:

import shelve

d = shelve.open(filename) # open -- file may get suffix added by low-level
                          # library

d[key] = data   # store data at key (overwrites old data if
                # using an existing key)
data = d[key]   # retrieve a COPY of data at key (raise KeyError if no
                # such key)
del d[key]      # delete data stored at key (raises KeyError
                # if no such key)
flag = d.has_key(key)   # true if the key exists
klist = d.keys() # a list of all existing keys (slow!)

# as d was opened WITHOUT writeback=True, beware:
d['xx'] = range(4)  # this works as expected, but...
d['xx'].append(5)   # *this doesn't!* -- d['xx'] is STILL range(4)!

# having opened d without writeback=True, you need to code carefully:
temp = d['xx']      # extracts the copy
temp.append(5)      # mutates the copy
d['xx'] = temp      # stores the copy right back, to persist it

# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.

d.close()       # close it
Run Code Online (Sandbox Code Playgroud)