小编Lar_der的帖子

如何在许多大文件中找到重复的行？

我有 ~30k 文件。每个文件包含约 10 万行。一行不包含空格。单个文件中的行已排序且无重复。

我的目标：我想找到两个或多个文件中的所有重复行，以及包含重复条目的文件的名称。

一个简单的解决方案是这样的：

cat *.words | sort | uniq -c | grep -v -F '1 '

Run Code Online (Sandbox Code Playgroud)

然后我会跑：

grep 'duplicated entry' *.words

Run Code Online (Sandbox Code Playgroud)

你看到更有效的方法吗？

performance large-files shell-script text-processing deduplication

10
推荐指数

2
解决办法

2万
查看次数

标签统计

deduplication ×1

large-files ×1

performance ×1

shell-script ×1

text-processing ×1