为什么 GNU diff 这样的内存猪？

Question

为什么 GNU diff 这样的内存猪？

有足够多的问题询问如何区分大文件，因为diff无法处理它们。

我想知道为什么GNU diff 无法处理它们。

我做了一个小实验。我比较了两个相同的数据集，就像这样

$ time /usr/bin/diff -u <(cat file1) <(cat file2) > /tmp/memoryhog
^C

real    5m6.478s
user    0m0.540s
sys     0m19.184s

Run Code Online (Sandbox Code Playgroud)

这是 top 在我取消工作的同时显示的内容：

3  PID  %MEM    VIRT   SWAP    RES   CODE    DATA    SHR nMaj nDRT S  PR  NI  %CPU COMMAND
 19087  30.0   16.0g          9.4g   0.2m   16.0g   2.0m           R  20       0.5 /usr/bin/diff -u /dev/fd/63 /dev/fd/62

Run Code Online (Sandbox Code Playgroud)

正如预期的那样，输出是空的：

$ stat -c '%s' /tmp/memoryhog 
0

Run Code Online (Sandbox Code Playgroud)

（它们实际上不是文件，而是数据库结果，我忘记跟踪diff当时实际消耗了多少字节- 估计每个管道文件 30-60GiB。）

但是那里发生了什么？

diff 当它甚至不需要跟踪单个字节更改时，是否正在分配大量内存？

我只能假设部分原因是必须跟踪行数，但是分配 16GiB 虚拟内存对于该任务来说似乎有点多！

是什么diff觉得需要那么多的内存？或者只是糟糕的内存处理？

我已经尝试将 diff 尽可能保持为“无状态”或无上下文，仅使用-u，但我找不到任何选项来不跟踪行号或以其他方式进一步改进。

~~（这个选项--speed-large-files实际上是一个虚假的参数，它没有在代码中实现，所以请不要建议那个。）~~

编辑：纠正我自己的代码检查的错误结果，我发现这埋在 bug-diffutils ML 中：

当你通过时--speed-large-files，src/diff.c设置speed_large_files为真。然后，src/analyze.c设置ctxt.heuristic为 true。然后，gnulib/lib/diffseq.h函数diag()应用您的启发式方法。

Answer 1

nyo*_*yov 3

我相信我找到了这种行为的原因。
似乎总是diff将整个文件读入内存。老实说我对此感到惊讶。我不认为情况如此，或者对于基于行的工具来说是必要的，但显然确实如此。

此信息基于此处的错误报告：https ://debbugs.gnu.org/cgi/bugreport.cgi?bug=21665

Unfortunately I have found that diff reads the entire input files into memory, leading to "/usr/bin/diff: memory exhausted" messages [...]
Run Code Online (Sandbox Code Playgroud)
并附有回复，内容如下：

> Would you be open to patches that enable diffing large files by using > mmap? I doubt whether that would help that much, as it still needs to construct information about each line, and that information consumes memory too. Doing this in secondary storage would be a bear. In practice when I've run into this problem, I've either gotten a bigger machine or made my input lines shorter. Preferably the former.
Run Code Online (Sandbox Code Playgroud)
最后

As Paul responded [...], using mmap seems unlikely to help much, but if you write the patch and demonstrate that it does make a difference, we'll be very interested, and I will happily reopen the issue. For now, I'm marking this as notabug and closing it.
Run Code Online (Sandbox Code Playgroud)
在这种情况下，GNU diff 在大文件处理方面似乎仍将受到限制，除非有人找到一种方法来克服错误报告中指出的可能性，或者实现一种不同工作方式的 diff 工具。

如果有人提出更好或更深入的答案（也许来自代码审查），我会很乐意接受。

PS 到目前为止，我使用基于difflib的 Python 的逐行阅读器仅取得了中等成功，该阅读器旨在查找差异，但不创建可修补的差异文件；它可以读取几个 GiB，但在某些时候似乎会“不同步”，在那之后报告实际上相同的行的差异。当然，它很慢。如果我可以在某个时候构建一个可行的解决方案，我将发布源代码。

归档时间：	6 年，5 月前
查看次数：	319 次
最近记录：	6 年，5 月前