查找内容相同的文件

Question

查找内容相同的文件

使用 Kubator 命令行回答我的问题：

 #Function that shows the files having the same content in the current directory
showDuplicates (){
  last_file=''
  while read -r f1_hash f1_name; do
    if [ "$last_file" != "$f1_hash" ]; then
      echo "The following files have the exact same content :"
      echo "$f1_name"
      while read -r f2_hash f2_name; do
        if [ "$f1_hash" == "$f2_hash" ] && [ "$f1_name" != "$f2_name" ]; then
          echo "$f2_name"
        fi
      done < <(find ./ -maxdepth 1 -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D)
    fi
    last_file="$f1_hash"
  done < <(find ./ -maxdepth 1 -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D)
}

Run Code Online (Sandbox Code Playgroud)

原问题：

我已经看到了一些关于我要问的问题的讨论，但我很难理解所提出的解决方案背后的机制，并且我无法解决接下来的问题。

我想创建一个比较文件的函数，为此，我天真地尝试了以下方法：

#somewhere I use that to get the files paths
files_to_compare=$(find $base_path -maxdepth 1 -type f)
files_to_compare=( $files_to_compare )

#then I pass files_to_compare as an argument to the following function
showDuplicates (){
  files_to_compare=${1}
  n_files=$(( ${#files_to_compare[@]} ))
  for (( i=0; i < $n_files ; i=i+1 )); do
     for (( j=i+1; j < $n_files ; j=j+1 )); do
         sameContent "${files_to_compare[i]}" "${files_to_compare[j]}"
         r=$?
         if [ $r -eq 1 ]; then
            echo "The following files have the same content :"
            echo ${files_to_compare[i]}
            echo ${files_to_compare[j]}
         fi
    done
  done
}

Run Code Online (Sandbox Code Playgroud)

函数“sameContent”采用两个文件的绝对路径，并根据具有相同内容的文件（分别）使用不同的命令（du、wc、diff）返回 1 或 0。

该代码的不正确性表现为文件名中包含空格，但我后来读到这不是在 bash 中操作文件的方法。

在https://unix.stackexchange.com/questions/392393/bash-moving-files-with-spaces和其他一些页面上，我读到正确的方法是使用如下所示的代码：

$ while IFS= read -r file; do echo "$file"; done < files

Run Code Online (Sandbox Code Playgroud)

我似乎无法理解这段代码背后的内容以及如何使用它来解决我的问题。特别是因为我想要/需要使用复杂的循环。

我是 bash 的新手，这似乎是一个常见问题，但如果有人足够好心让我了解它是如何工作的，那就太好了。

ps：可能出现的语法错误请多多包涵

Answer 1

Kub*_*tor 7

如何使用 md5sum 来比较您的文件夹中的文件内容。这是更安全、标准的方式。那么你只需要这样的东西：

find ./ -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D

Run Code Online (Sandbox Code Playgroud)

它能做什么：

find查找-type f当前文件夹中的所有文件./，输出由-print0文件名中的空格等特殊字符所需的空字节分隔（例如您提到用空格移动文件）
xargs从 find 中获取由空字节分隔的输出-0并对md5sum文件执行哈希
sort按位置 1-32 对输出进行排序（即 md5 哈希值）-k1,32
uniq使输出的前 32 个字符（md5 哈希）唯一-w32并仅过滤重复的行-D

输出示例：

7a2e203cec88aeffc6be497af9f4891f  ./file1.txt
7a2e203cec88aeffc6be497af9f4891f  ./folder1/copy_of_file1.txt
e97130900329ccfb32516c0e176a32d5  ./test.log
e97130900329ccfb32516c0e176a32d5  ./test_copy.log

Run Code Online (Sandbox Code Playgroud)

如果性能至关重要，可以将其调整为首先按文件大小排序，然后再比较 md5sum。或称为 mv、rm 等。

归档时间：	7 年前
查看次数：	1615 次
最近记录：	7 年前