小编dan*_*man的帖子

python - 通过readlines(size)提高大文件搜索的效率

我是Python新手,我目前正在使用Python 2.我有一些源文件,每个源文件都包含大量数据(大约1900万行).它看起来如下:

apple   \t N   \t apple
n&apos
garden  \t N   \t garden
b\ta\md 
great   \t Adj \t great
nice    \t Adj \t (unknown)
etc

Run Code Online (Sandbox Code Playgroud)

我的任务是在每个文件的第3列搜索一些目标词,并且每次在语料库中找到目标词时,必须将该词前后的10个词添加到多维词典中.

编辑:应排除包含'&','\'或字符串'(未知)'的行.

我尝试使用readlines()和enumerate()来解决这个问题,如下面的代码所示.代码执行它应该做的事情但显然对源文件中提供的数据量不够高效.

我知道readlines()或read()不应该用于大型数据集,因为它将整个文件加载到内存中.然而,逐行读取文件,我没有设法使用枚举方法来获取目标词之前和之后的10个单词.我也不能使用mmap,因为我没有权限在该文件上使用它.

所以,我认为具有一定大小限制的readlines方法将是最有效的解决方案.然而,为此,我不会做出一些错误,因为每次达到大小限制结束时,目标字不会被捕获,因为代码刚刚破坏了10个字？

def get_target_to_dict(file):
targets_dict = {}
with open(file) as f:
    for line in f:
            targets_dict[line.strip()] = {}
return targets_dict

targets_dict = get_target_to_dict('targets_uniq.txt')
# browse directory and process each file 
# find the target words to include the 10 words before and after to the dictionary
# exclude lines starting with …

Run Code Online (Sandbox Code Playgroud)

python dictionary enumerate multidimensional-array readlines

dan*_*man

2016 11-14

7
推荐指数

1
解决办法

256
查看次数