快速校验和散列?

mpe*_*pen 3 c# optimization

编写一个简单的程序,可以在我的计算机上找到完全重复的文件,但速度有点慢.有什么方法可以加快速度吗?

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Security.Cryptography;

namespace DupeFinder
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Getting files...");
            var files = Directory.GetFiles(@"D:\Photos", "*", SearchOption.AllDirectories);
            var alg = new HMACMD5();
            var dict = new Dictionary<string,List<string>>();
            Console.WriteLine("Computing hashes...");
            int i=0;
            int cursorLeft = Console.CursorLeft;
            int cursorTop = Console.CursorTop;
            foreach(var fileName in files)
            {
                Console.SetCursorPosition(cursorLeft,cursorTop);
                Console.Write("Hashing file {0}/{1}", ++i, files.Length);
                using(var stream = new BufferedStream(File.OpenRead(fileName),1024*1024*5))
                {
                    var hash = alg.ComputeHash(stream);
                    var str = BitConverter.ToString(hash);
                    if (!dict.ContainsKey(str)) dict[str] = new List<string>();
                    dict[str].Add(fileName);
                }
            }
            Console.WriteLine();

            foreach(var dupe in dict.Where(p => p.Value.Count >= 2))
            {
                Console.WriteLine(string.Join(", ", dupe.Value));
            }

            Console.WriteLine("Done!");
            Console.ReadLine();
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

可能的优化:

  1. 避免首先将字节数组转换为字符串.我试过这个,但它不起作用,我猜是因为它使用引用等于而不是比较字节.
  2. 更快的哈希算法?
  3. 不同的流或缓冲区大小?我尝试做1024 ^ 3这应该是一个兆字节,但这似乎减慢了它,如果有的话.

或者这只是一个固有的缓慢的事情?


我记得那些字典可以接受IEqualityComparer,所以我可以写自己的byte[]比较器.

我在互联网上找到的很多算法都倾向于首先比较字节长度,这是我不需要做的,因为我知道它总是会有16个字节.他们也倾向于一次比较1个字节...但我在64位机器上,为什么不做8?

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Security.Cryptography;

namespace DupeFinder
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Getting files...");
            string dir = @"D:\Photos";
            var files = Directory.GetFiles(dir, "*", SearchOption.AllDirectories);
            var alg = new HMACMD5();
            var dict = new Dictionary<byte[], List<string>>(new Md5Comparer());
            Console.WriteLine("Computing hashes...");
            int i = 0;
            int cursorLeft = Console.CursorLeft;
            int cursorTop = Console.CursorTop;
            foreach (var fileName in files)
            {
                Console.SetCursorPosition(cursorLeft, cursorTop);
                Console.Write("Hashing file {0}/{1}", ++i, files.Length);
                using (var stream = new BufferedStream(File.OpenRead(fileName), 1024 * 1024 * 5))
                {
                    var hash = alg.ComputeHash(stream);
                    if (!dict.ContainsKey(hash)) dict[hash] = new List<string>();
                    dict[hash].Add(fileName);
                }
            }
            Console.WriteLine();

            using (var sw = new StreamWriter(Path.Combine(dir, "duplicates.txt")))
            {
                i = 0;
                foreach (var dupe in dict.Where(p => p.Value.Count >= 2))
                {
                    sw.WriteLine("Duplicate {0}", ++i);
                    foreach(var fn in dupe.Value)
                    {
                        sw.WriteLine("- {0}", fn);
                    }
                }
            }

            Console.WriteLine("Done!");
            //Console.ReadLine();
        }
    }

    class Md5Comparer : IEqualityComparer<byte[]>
    {
        public bool Equals(byte[] x, byte[] y)
        {
            var xi = BitConverter.ToInt64(x, 0);
            var yi = BitConverter.ToInt64(y, 0);
            if (xi != yi) return false;
            xi = BitConverter.ToInt64(x, 8);
            yi = BitConverter.ToInt64(y, 8);
            return xi == yi;
        }

        public int GetHashCode(byte[] obj)
        {
            return obj[0];
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

不确定这是多快多了....我没有做任何基准测试,但它似乎没有任何慢.


新代码,感谢@spender:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Security.Cryptography;

namespace DupeFinder
{
    class Program
    {
        static void Main(string[] args)
        {
            var watch = Stopwatch.StartNew();
            const string dir = @"D:\Photos";
            var md5Comparer = new Md5Comparer();

            var dupeGroups = Directory.EnumerateFiles(dir, "*", SearchOption.AllDirectories)
                .Select(fn => new FileInfo(fn))
                .GroupBy(fi => fi.Length)
                .Where(g => g.Count() > 1)
                .SelectMany(g => g
                   .GroupBy(fi => GetHash(fi.FullName), md5Comparer)
                   .Where(g2 => g2.Count() > 1));

            using (var sw = new StreamWriter(Path.Combine(dir, "duplicates.txt")))
            {
                int i = 0;
                foreach (var dupeGroup in dupeGroups)
                {
                    sw.WriteLine("Duplicate {0}", ++i);
                    foreach(FileInfo fi in dupeGroup)
                    {
                        sw.WriteLine("- {0}", fi.FullName);
                    }
                }
            }

            Console.WriteLine("{0:0.000} seconds", watch.ElapsedMilliseconds / 1000d); // 22.068 seconds to process 10K files, 37 GB, 463 dupes
            Console.ReadLine();
        }

        static readonly HMACMD5 md5Hasher = new HMACMD5();
        public static byte[] GetHash(string fileName)
        {
            using(var stream = File.OpenRead(fileName))
                return md5Hasher.ComputeHash(stream);
        }
    }

    class Md5Comparer : IEqualityComparer<byte[]>
    {
        public bool Equals(byte[] x, byte[] y)
        {
            var xi = BitConverter.ToInt64(x, 0);
            var yi = BitConverter.ToInt64(y, 0);
            if (xi != yi) return false;
            xi = BitConverter.ToInt64(x, 8);
            yi = BitConverter.ToInt64(y, 8);
            return xi == yi;
        }

        public int GetHashCode(byte[] obj)
        {
            return obj[0];
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

从360+秒减少到22-70秒.相当好的改进!

spe*_*der 5

你错过了大量的光学:如果尺寸不匹配......

此外,匹配哈希不保证匹配内容.

编辑

好.找到基于尺寸的傻瓜并不难,你可以检查它们是否真的是傻瓜:

例如:

var files = 
    Directory
        .GetFiles(
            @"C:\Users\Administrator\Downloads",
            "*", 
            SearchOption.AllDirectories);
var fileGroups=
    files
        .Select(fn => new FileInfo(fn))
        .GroupBy(fi => fi.Length)
        .Where(g => g.Count()>1);
Run Code Online (Sandbox Code Playgroud)

您可以通过为给定文件名创建散列函数来进一步使用它

int GetHash(fileName)  //fill in your definition
Run Code Online (Sandbox Code Playgroud)

然后...

var fg2 =
        fileGroups
            .SelectMany(
                g => g
                    .GroupBy(fi => GetHash(fi.FullName))
                    .Where(gg=>gg.Count()>1));

foreach(var dupeGroup in fg2)
{
    //everything in this group is probably duplicate
    //although you'd need to do byte by byte to be sure
    foreach(FileInfo dupe in dupeGroup)
    {

    }
}
Run Code Online (Sandbox Code Playgroud)

通过这样做,您将大大减少所需的散列量,因为您已根据大小预先过滤了候选者.

  • 从本质上讲,您希望查看所有文件并仅考虑大小相同的文件.因此,生成具有此属性的字典,然后仅散列相同大小的字典.这将给予戏剧性的加速消费者评论.在实际执行哈希之前,您还可以检查其他一些特定于文件的特定事物(标题等).然后,如果散列检查相同,则对该文件进行逐字节操作.这是个主意. (2认同)