如何在python中为大数据创建哈希表?

kis*_*dbn 3 python hash

我正在开发一个项目,我在列表中读取多达250000个项目或更多项目,并将每个项目作为哈希表的关键字进行转换.

sample_key = open("sample_file.txt").readlines()
sample_counter = [0] * (len(sample_key))
sample_hash = {sample.replace('\n', ''):counter for sample, counter in zip(sample_key, sample_counter)}
Run Code Online (Sandbox Code Playgroud)

当代码len(sample_key)在1000-2000范围内时,此代码运行良好.Beyound认为它只是忽略处理任何进一步的数据.

有什么建议,我该如何处理这个大型列表数据?

PS:此外,如果有一种最佳方式来执行此任务(如直接读取哈希键条目),请建议.我是Python的新手.

Ale*_*der 6

您的文本文件可能有重复项,这将覆盖您的字典中的现有键(哈希表的python名称).您可以创建一组唯一的键,然后使用字典理解来填充字典.

sample_file.txt

a
b
c
c
Run Code Online (Sandbox Code Playgroud)

Python代码

with open("sample_file.txt") as f:
    keys = set(line.strip() for line in f.readlines())
my_dict = {key: 1 for key in keys if key}
>>> my_dict
{'a': 1, 'b': 1, 'c': 1}
Run Code Online (Sandbox Code Playgroud)

这是一个包含100万个长度为10的随机字母字符的实现.时间在不到半秒的时间内相对微不足道.

import string
import numpy as np

letter_map = {n: letter for n, letter in enumerate(string.ascii_lowercase, 1)}
long_alpha_list = ["".join([letter_map[number] for number in row]) + "\n" 
                   for row in np.random.random_integers(1, 26, (1000000, 10))]
>>> long_alpha_list[:5]
['mfeeidurfc\n',
 'njbfzpunzi\n',
 'yrazcjnegf\n',
 'wpuxpaqhhs\n',
 'fpncybprrn\n']

>>> len(long_alpha_list)
1000000

# Write list to file.
with open('sample_file.txt', 'wb') as f:
    f.writelines(long_alpha_list)

# Read them back into a dictionary per the method above.
with open("sample_file.txt") as f:
    keys = set(line.strip() for line in f.readlines())

>>> %%timeit -n 10
>>> my_dict = {key: 1 for key in keys if key}

10 loops, best of 3: 379 ms per loop
Run Code Online (Sandbox Code Playgroud)