小编Blo*_*b X的帖子

检查多个文件之间重复数据的最有效方法是什么？

假设您有一个文件夹，其中包含可能包含不同信息的成百上千个.csv或.txt文件，但您想确保它joe041.txt实际上不包含joe526.txt意外的相同数据。

与其将所有内容都加载到一个文件中（如果每个文件都有数千行，这可能会很麻烦），我已经开始使用 Python 脚本来基本上读取目录中的每个文件并计算校验和，然后您可以在数千行之间进行比较的文件。

有没有更有效的方法来做到这一点？

即使使用filecmp了，这似乎不太有效，因为模块只有文件VS文件和目录VS目录比较，但没有文件VS DIR命令-即使用它，这意味着你不得不遍历通过X ²次（所有文件dir对所有中的其他文件dir）。

import os
import hashlib

outputfile = []

for x in(os.listdir("D:/Testing/New folder")):
    with open("D:/Testing/New folder/%s" % x, "rb") as openfile:
        text=openfile.read()
        outputfile.append(x)
        outputfile.append(",")
        outputfile.append(hashlib.md5(text).hexdigest())
        outputfile.append("\n")

print(outputfile)

with open("D:/Testing/New folder/output.csv","w") as openfile:
    for x in outputfile:
        openfile.write(x)

Run Code Online (Sandbox Code Playgroud)

python checksum

Blo*_*b X

2018 10-24

4
推荐指数

1
解决办法

655
查看次数

标签统计

checksum ×1

python ×1

检查多个文件之间重复数据的最有效方法是什么？

标签 统计

小编Blo_b X的帖子

标签统计