从单个目录中删除具有重复内容的文件[Perl或算法]

Question

从单个目录中删除具有重复内容的文件[Perl或算法]

我有一个包含大量文件的文件夹,其中一些文件具有完全相同的内容.我想删除包含重复内容的文件,这意味着如果找到两个或多个重复内容的文件,我想留下其中一个文件,并删除其他文件.

以下是我想出来的,但我不知道它是否有效:),还没试过.

你会怎么做？Perl或一般算法.

use strict;
use warnings;

my @files = <"./files/*.txt">;

my $current = 0;

while( $current <= $#files ) {

    # read contents of $files[$current] into $contents1 scalar

    my $compareTo = $current + 1;
    while( $compareTo <= $#files ) {

        # read contents of $files[compareTo] into $contents2 scalar

        if( $contents1 eq $contents2 ) {
            splice(@files, $compareTo, 1);
            # delete $files[compareTo] here
        }
        else {
            $compareTo++;
        }
    }

    $current++;
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Eth*_*her 8

这是一个通用算法(为了效率而编辑我现在已经摆脱了困境 - 我还修复了一个没人报告的错误)... :)

如果我将每个文件的内容与其他文件的内容进行比较,那么它将需要永远(更不用说大量的内存).相反,为什么我们不首先将相同的搜索应用于它们的大小,然后比较相同大小的文件的校验和.

那么当我们~~md5sum每个文件(参见Digest :: MD5)~~计算它们的大小时,我们可以使用哈希表为我们进行匹配,将匹配存储在arrayrefs中:

use strict;
use warnings;
use Digest::MD5 qw(md5_hex);

my %files_by_size;
foreach my $file (@ARGV)
{
    push @{$files_by_size{-s $file}}, $file;   # store filename in the bucket for this file size (in bytes)
}

Run Code Online (Sandbox Code Playgroud)

现在我们只需要删除潜在的重复项,并使用相同的散列技术检查它们是否相同(通过使用Digest :: MD5为每个创建校验和):

while (my ($size, $files) = each %files_by_size)
{
    next if @$files == 1;

    my %files_by_md5;
    foreach my $file (@$files_by_md5)
    {
        open my $filehandle, '<', $file or die "Can't open $file: $!";
        # enable slurp mode
        local $/;
        my $data = <$filehandle>;
        close $filehandle;

        my $md5 = md5_hex($data);
        push @{$files_by_md5{$md5}}, $file;       # store filename in the bucket for this MD5
    }

    while (my ($md5, $files) = each %files_by_md5)
    {
        next if @$files == 1;
        print "These files are equal: " . join(", ", @$files) . "\n";
    }
}

Run Code Online (Sandbox Code Playgroud)

-fini

我会根据尺寸统计文件,只检查尺寸相同的md5总和. (3认同)
是否这样做取决于文件的大小.通常,如果我正在寻找重复项,那么它是在大型图像文件中,其中md5位将非常慢.对于像程序文件这样的文本文件,它不太可能成为一个大问题,所以简单的代码就可以了. (2认同)

Answer 2

Rud*_*dog 6

md5sum *.txt | perl -ne '
   chomp; 
   ($sum, $file) = split(" "); 
   push @{$files{$sum}}, $file; 
   END {
      foreach (keys %files) { 
         shift @{$files{$_}}; 
         unlink @{$files{$_}} if @{$files{$_}};
      }
   }
'

Run Code Online (Sandbox Code Playgroud)

如果您需要该脚本的注释,请立即关闭终端窗口,切勿输入其他程序. (7认同)
我认为如果原始文件名中有空格,则可能会失败.要解决这个问题,请使用`split"",$ _,2`代替,其中2阻止它分裂不止一次(分成两部分). (3认同)

Answer 3

gho*_*g74 6

Perl,带Digest :: MD5模块.

use Digest::MD5 ;
%seen = ();
while( <*> ){
    -d and next;
    $filename="$_"; 
    print "doing .. $filename\n";
    $md5 = getmd5($filename) ."\n";    
    if ( ! defined( $seen{$md5} ) ){
        $seen{$md5}="$filename";
    }else{
        print "Duplicate: $filename and $seen{$md5}\n";
    }
}
sub getmd5 {
    my $file = "$_";            
    open(FH,"<",$file) or die "Cannot open file: $!\n";
    binmode(FH);
    my $md5 = Digest::MD5->new;
    $md5->addfile(FH);
    close(FH);
    return $md5->hexdigest;
}

Run Code Online (Sandbox Code Playgroud)

如果Perl不是必须的并且您正在使用*nix,则可以使用shell工具

find /path -type f -print0 | xargs -0 md5sum | \
    awk '($1 in seen){ print "duplicate: "$2" and "seen[$1] } \
         ( ! ($1 in  seen ) ) { seen[$1]=$2 }'

Run Code Online (Sandbox Code Playgroud)

归档时间：	15 年，11 月前
查看次数：	2425 次
最近记录：	12 年，4 月前