.net在所有目录中查找与模式匹配的所有文件的最快方法

Dou*_*ain 5 vb.net

我有*500K文件的*.ax5,我必须处理并导出到另一种格式.由于文件数量很多,并且由于Windows性能问题导致一个文件夹中的文件太多,因此它们被隐藏在具有不同扩展名的其他文件的子文件夹中.在C#中,找到C:\ Sketch下任何级别子文件夹中包含的每个文件的最快方法是什么?

初始运行后,文件夹结构总是相同的AAAA\BB\CCCC_BLD [一堆不同的文件类型],我也只想处理写日期大于上次运行日期的文件.

或者,如何快速获取显示处理百分比的记录数?

我无法更改供应商设置的文件/文件夹的源结构

这就是我所拥有的.我都试过Array.ForEachParallel.ForEach双方似乎很慢.

Sub walkTree(ByVal directory As DirectoryInfo, ByVal pattern As String)
    Array.ForEach(directory.EnumerateFiles(pattern).ToArray(), Sub(fileInfo)
                                                                   Export(fileInfo)
                                                               End Sub)
    For Each subDir In directory.EnumerateDirectories()
        walkTree(subDir, pattern)    
    Next
End Sub
Run Code Online (Sandbox Code Playgroud)

Jen*_*ens 7

http://msdn.microsoft.com/en-us/library/ms143316(v=vs.110).aspx

Directory.GetFiles(@"C:\Sketch", "*.ax5", SearchOption.AllDirectories);
Run Code Online (Sandbox Code Playgroud)

可能对你好吗?


至于性能,我怀疑你会发现任何更快的扫描目录的方法,因为@Mathew Foscarini指出,你的磁盘是这里的瓶颈.

如果目录被索引,那么使用它会更快,因为@jaccus提到.


我花了一些时间对事情进行基准测试.实际上,您似乎能够以异步方式收集文件,从而获得33%的性能提升.

我运行的测试集可能与你的情况不符,我不知道你的文件是如何嵌套的等等......但我所做的是在每个级别的每个目录中创建5000个随机文件(虽然我已经确定了单个级别)和100个目录共计505.000个文件......

我测试了3种收集文件的方法......

最简单的方法.

public class SimpleFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        return new List<string>( Directory.GetFiles(directory.FullName, pattern, SearchOption.AllDirectories));
    }
}
Run Code Online (Sandbox Code Playgroud)

"愚蠢"的方法,虽然如果你知道Simple方法中使用的过载,这只是愚蠢...否则这是一个非常好的解决方案.

public class DumbFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        List<string> files = new List<string>(500000);
        files.AddRange(directory.GetFiles(pattern).Select(file => file.FullName));

        foreach (DirectoryInfo dir in directory.GetDirectories())
        {
            files.AddRange(CollectFiles(dir, pattern));
        }
        return files;
    }
}
Run Code Online (Sandbox Code Playgroud)

任务API方法......

public class ThreadedFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
        InternalCollectFiles(directory, pattern, queue);
        return queue.AsEnumerable().ToList();
    }

    private void InternalCollectFiles(DirectoryInfo directory, string pattern, ConcurrentQueue<string> queue)
    {
        foreach (string result in directory.GetFiles(pattern).Select(file => file.FullName))
        {
            queue.Enqueue(result);
        }

        Task.WaitAll(directory
            .GetDirectories()
            .Select(dir => Task.Factory.StartNew(() => InternalCollectFiles(dir, pattern, queue))).ToArray());
    }
}
Run Code Online (Sandbox Code Playgroud)

这只是收集所有文件的测试.不处理它们,处理开始线程是有意义的.

以下是我系统的结果:

Simple Collector:
 - Pass 0: found 505000 files in 2847 ms
 - Pass 1: found 505000 files in 2865 ms
 - Pass 2: found 505000 files in 2860 ms
 - Pass 3: found 505000 files in 3061 ms
 - Pass 4: found 505000 files in 3006 ms
 - Pass 5: found 505000 files in 2807 ms
 - Pass 6: found 505000 files in 2849 ms
 - Pass 7: found 505000 files in 2789 ms
 - Pass 8: found 505000 files in 2790 ms
 - Pass 9: found 505000 files in 2788 ms
Average: 2866 ms

Dumb Collector:
 - Pass 0: found 505000 files in 5190 ms
 - Pass 1: found 505000 files in 5204 ms
 - Pass 2: found 505000 files in 5453 ms
 - Pass 3: found 505000 files in 5311 ms
 - Pass 4: found 505000 files in 5339 ms
 - Pass 5: found 505000 files in 5362 ms
 - Pass 6: found 505000 files in 5316 ms
 - Pass 7: found 505000 files in 5319 ms
 - Pass 8: found 505000 files in 5583 ms
 - Pass 9: found 505000 files in 5197 ms
Average: 5327 ms

Threaded Collector:
 - Pass 0: found 505000 files in 2152 ms
 - Pass 1: found 505000 files in 2102 ms
 - Pass 2: found 505000 files in 2022 ms
 - Pass 3: found 505000 files in 2030 ms
 - Pass 4: found 505000 files in 2075 ms
 - Pass 5: found 505000 files in 2120 ms
 - Pass 6: found 505000 files in 2030 ms
 - Pass 7: found 505000 files in 1980 ms
 - Pass 8: found 505000 files in 1993 ms
 - Pass 9: found 505000 files in 2120 ms
Average: 2062 ms
Run Code Online (Sandbox Code Playgroud)

作为旁注,@ Konrad Kokosa建议阻止每个目录以确保不启动数百万个线程,不要那样做......

您没有理由管理在给定时间将激活的线程数,让Task框架标准调度程序处理它,它将在根据您拥有的核心数平衡线程数方面做得更好...

如果你真的不想控制它自己只是因为,实现自定义调度程序将是一个更好的选择:http://msdn.microsoft.com/en-us/library/system.threading.tasks.taskscheduler(v = vs.110)的.aspx