如何比较（比较）两个顺序无关紧要的大型 CSV 文件

Question

如何比较（比较）两个顺序无关紧要的大型 CSV 文件

luc*_*i5r 0 python diff compare large-files

我正在努力比较（比较）2 个大型 CSV 文件。

行的顺序并不重要
我不需要打印差异或任何内容，只需打印对或错。

例如：

文件1

a,b,c,d
e,f,g,h
i,j,k,l

Run Code Online (Sandbox Code Playgroud)

文件2

a,b,c,d
i,j,k,l
e,f,g,h

Run Code Online (Sandbox Code Playgroud)

上面应该通过比较，即使行的顺序不同，内容也是相同的。

如果内容不同、列值不匹配或者某一行存在于另一行中，等等，则比较应该失败。

我遇到的最大问题是文件非常大，并且没有可排序的键列。文件有 14 到 3000 万行，大约 10 到 15 列。未排序的原始数据转储约为 1GB 的 csv 文件。

现在我正在尝试使用下面的代码对文件进行排序和“比较”。问题是“排序”并不总是有效。对于较小的文件和较少的行，排序和比较可以工作，但它似乎不适用于非常大的文件。

此外，排序会显着增加操作时间；理想情况下，我想避免排序，只是比较忽略排序顺序，但我不知道该怎么做。

filecmm、difflib 和我尝试过的其他一些函数都需要预先排序的文件。

我现在正在执行 Python 合并排序，但正如我所说，排序不一定适用于大量行，我希望有更好的比较方法。

这是Python的归并排序函数：

def batch_sort(self, input, output, key=None, buffer_size=32000, tempdirs=None):
                if isinstance(tempdirs, str):
                        tempdirs = tempdirs.split(",")

                if tempdirs is None:
                        tempdirs = []
                if not tempdirs:
                        tempdirs.append(gettempdir())

                chunks = []
                try:
                        with open(input,'rb',64*1024) as input_file:
                                input_iterator = iter(input_file)
                                for tempdir in cycle(tempdirs):
                                        current_chunk = list(islice(input_iterator,buffer_size))
                                        if not current_chunk:
                                                break
                                        current_chunk.sort(key=key)
                                        output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
                                        chunks.append(output_chunk)
                                        output_chunk.writelines(current_chunk)
                                        output_chunk.flush()
                                        output_chunk.seek(0)
                        with open(output,'wb',64*1024) as output_file:
                                output_file.writelines(self.merge(key, *chunks))
                finally:
                        for chunk in chunks:
                                try:
                                        chunk.close()
                                        os.remove(chunk.name)
                                except Exception:
                                        pass

Run Code Online (Sandbox Code Playgroud)

我可以调用batch_sort()，给它一个输入文件和输出文件、块的大小以及要使用的临时目录。

一旦我对两个文件执行batch_sort()，我就可以“diff file1 file2”。

上述适用于 25,000 到 75,000 行，但不适用于超过 1400 万行。

Answer 1

C.N*_*ivs 6

只需使用 aset并添加每一行。最后比较集合：

def compare(file1, file2):
    with open(file1) as fh1, open(file2) as fh2:
        left = {line for line in fh1}
        right = {line for line in fh2}

    return left == right

Run Code Online (Sandbox Code Playgroud)

如果您真的关心大小，您可以使用一个文件，而第二个文件中找不到第二行，您可以将其短路：

def compare(file1, file2):
    with open(file1) as fh:
        left = {line for line in fh}
    
    right = set()

    with open(file2) as fh:
        for line in fh:
            if line not in left:
                return False
        
            right.add(line)

    return left == right

Run Code Online (Sandbox Code Playgroud)

编辑

由于您不关心显示差异，只需检查它们是否相同，您可以在每行上使用数字哈希并比较每个文件的总和：

def are_equivalent(a, b):
    with open(a) as fh, open(b) as gh:
        x = sum(hash(line) for line in fh)
        y = sum(hash(line) for line in gh)

    return x == y

Run Code Online (Sandbox Code Playgroud)

这样您就不需要为任何数据结构中的行存储付费

归档时间：	2 年前
查看次数：	659 次
最近记录：	2 年前