在多个文件中搜索单词的最有效方法

Question

在多个文件中搜索单词的最有效方法

Pie*_*ale 2 python string performance search file

为了我的硕士论文，我下载了大量与金融相关的文件。我的目标是找到一组特定的词语（“第11章”）来标记所有已完成债务重组过程的公司。问题是我有超过 120 万个小文件，这使得搜索效率很低。现在我编写了非常基本的代码，并且达到了每 40-50 秒 1000 个文档的速度。我想知道是否有一些特定的库或方法（甚至编程语言）可以更快地搜索。这是我到目前为止使用的功能

def get_items(m):
    word = "chapter 11"
    f = open(m, encoding='utf8')
    document = f.read()
    f.close()
    return (word in document.lower())
# apply the function to the list of names:
l_v1 = list(map(get_items,filenames))

Run Code Online (Sandbox Code Playgroud)

文件大小在 5 到 4000 KB 之间变化

Answer 1

use*_*654 5

尝试 Unix 工具grep.

如果文件很少，你可以这样做：

grep -i "chapter 11" file1 file2 ...

Run Code Online (Sandbox Code Playgroud)

或者，

grep -i "chapter 11" file*.txt

Run Code Online (Sandbox Code Playgroud)

如果文件较多，可以结合grep使用find：

find . -type f | xargs grep -i "chapter 11"

Run Code Online (Sandbox Code Playgroud)

另一个强大的工具是ack（用 Perl 编写）——参见https://beyondgrep.com/。

归档时间：	3 年前
查看次数：	403 次
最近记录：	3 年前