小编fda*_*bhi的帖子

如何更快地在文本文件中搜索字符串

我想搜索保存在文件夹中的数千个文本文件(可能有多达100k文本文件,每个大小范围从1 KB到100 MB)的字符串列表(列表中包含2k到10k字符串)输出匹配的文本文件名的csv文件.

我已经开发了一个代码来完成所需的工作,但2000字符串需要大约8-9个小时来搜索大约2000个大小约为2.5 GB的文本文件.

此外,通过使用此方法,系统的内存被消耗,因此有时需要将2000个文本文件拆分为较小的批处理以运行代码.

代码如下(Python 2.7).

# -*- coding: utf-8 -*-
import pandas as pd
import os

def match(searchterm):
    global result
    filenameText = ''
    matchrateText = ''
    for i, content in enumerate(TextContent):
        matchrate = search(searchterm, content)
        if matchrate:
            filenameText += str(listoftxtfiles[i])+";"
            matchrateText += str(matchrate) + ";"
    result.append([searchterm, filenameText, matchrateText])


def search(searchterm, content):
    if searchterm.lower() in content.lower():
        return 100
    else:
        return 0


listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
    with open("Txt/"+txt, 'r') as txtfile:
        TextContent.append(txtfile.read())

result …

Run Code Online (Sandbox Code Playgroud)

python pandas

fda*_*bhi

2017 07-16

6
推荐指数

1
解决办法

193
查看次数