快速Linux文件计数用于大量文件

119 linux shell disk-io

当有大量文件(> 100,000)时,我正试图找出找到特定目录中文件数量的最佳方法.

当存在那么多文件时,执行"ls | wc -l"需要相当长的时间才能执行.我相信这是因为它返回了所有文件的名称.我试图占用尽可能少的磁盘IO.

我已经尝试了一些shell和Perl脚本无济于事.有任何想法吗？

默认情况下,ls对名称进行排序,如果有很多名称,则可能需要一段时间.在读取和排序所有名称之前,也不会输出.使用该ls -f选项可关闭排序.

ls -f | wc -l

Run Code Online (Sandbox Code Playgroud)

请注意,这还将使-a,所以.,..和开始与其他文件.将被计算在内.

`ls -f`也不是`stat()`.当然,当使用某些选项时,`ls`和`find`都会调用`stat()`,例如`ls -l`或`find -mtime`. (12认同)
+1我以为我知道有关'ls`的一切. (10认同)
对于上下文,在一个小小的Slicehost盒子上花费1-2分钟来计算250万个jpgs. (7认同)
ZOMG.排序100K行没什么 - 相比于`stat()`调用`ls`对每个文件都做了.`find`不是`stat()`因此它工作得更快. (5认同)
如果要将子目录添加到计数中,请执行`ls -fR | wc -l` (5认同)
@BryanP:当stdout不是终端时,`-1`是默认值(在这种情况下,ls stdout是一个管道) (3认同)
容易记住:`ls -f1 | wc -l`#F1非常快 (3认同)

最快的方法是专门构建的程序,如下所示:

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count = 0;

    dir = opendir(argv[1]);

    while((ent = readdir(dir)))
            ++count;

    closedir(dir);

    printf("%s contains %ld files\n", argv[1], count);

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

从我的测试中不考虑缓存,我对这个目录中的每一个都进行了大约50次,一遍又一遍,以避免基于缓存的数据偏斜,并且我得到了大致以下性能数字(在实际时钟时间内):

ls -1  | wc - 0:01.67
ls -f1 | wc - 0:00.14
find   | wc - 0:00.22
dircnt | wc - 0:00.04

Run Code Online (Sandbox Code Playgroud)

最后一个dircnt是从上面的源编译的程序.

编辑2016-09-26

由于受欢迎的需求,我重新编写了这个程序是递归的,所以它将落入子目录并继续分别计算文件和目录.

由于很明显有些人想知道如何做到这一切,我在代码中有很多评论,试图让它显而易见.我写了这个并在64位Linux上测试它,但它应该适用于任何符合POSIX标准的系统,包括Microsoft Windows.欢迎提供错误报告; 如果你无法在AIX或OS/400上运行它,我很乐意更新它.

正如你所看到的,这是很多比原来的和必然如此复杂:至少一个功能必须存在递归调用,除非你想要的代码变得非常复杂(如管理一个子目录栈和处理,在一个循环中).由于我们必须检查文件类型,不同操作系统,标准库等之间的差异开始发挥作用,所以我编写了一个程序,试图在任何可以编译的系统上使用.

错误检查非常少,而且count函数本身并不真正报告错误.能够真正失败的唯一电话是opendir和stat(如果你不是幸运,有其中系统dirent包含已在文件类型).关于检查子路径名的总长度,我并不偏执,但理论上,系统不应允许任何长度超过的路径名PATH_MAX.如果有问题,我可以解决这个问题,但这只是需要向学习写C的人解释的更多代码.该程序旨在成为如何递归地潜入子目录的示例.

#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>

#if defined(WIN32) || defined(_WIN32) 
#define PATH_SEPARATOR '\\' 
#else
#define PATH_SEPARATOR '/' 
#endif

/* A custom structure to hold separate file and directory counts */
struct filecount {
  long dirs;
  long files;
};

/*
 * counts the number of files and directories in the specified directory.
 *
 * path - relative pathname of a directory whose files should be counted
 * counts - pointer to struct containing file/dir counts
 */
void count(char *path, struct filecount *counts) {
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char subpath[PATH_MAX];  /* buffer for building complete subdir and file names */
    /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */
#if !defined ( _DIRENT_HAVE_D_TYPE )
    struct stat statbuf;     /* buffer for stat() info */
#endif

/* fprintf(stderr, "Opening dir %s\n", path); */
    dir = opendir(path);

    /* opendir failed... file likely doesn't exist or isn't a directory */
    if(NULL == dir) {
        perror(path);
        return;
    }

    while((ent = readdir(dir))) {
      if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) {
          fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
          return;
      }

/* Use dirent.d_type if present, otherwise use stat() */
#if defined ( _DIRENT_HAVE_D_TYPE )
/* fprintf(stderr, "Using dirent.d_type\n"); */
      if(DT_DIR == ent->d_type) {
#else
/* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */
      sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
      if(lstat(subpath, &statbuf)) {
          perror(subpath);
          return;
      }

      if(S_ISDIR(statbuf.st_mode)) {
#endif
          /* Skip "." and ".." directory entries... they are not "real" directories */
          if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) {
/*              fprintf(stderr, "This is %s, skipping\n", ent->d_name); */
          } else {
              sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
              counts->dirs++;
              count(subpath, counts);
          }
      } else {
          counts->files++;
      }
    }

/* fprintf(stderr, "Closing dir %s\n", path); */
    closedir(dir);
}

int main(int argc, char *argv[]) {
    struct filecount counts;
    counts.files = 0;
    counts.dirs = 0;
    count(argv[1], &counts);

    /* If we found nothing, this is probably an error which has already been printed */
    if(0 < counts.files || 0 < counts.dirs) {
        printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs);
    }

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

编辑2017-01-17

我已经合并了@FlyingCodeMonkey建议的两个更改:

用lstat而不是stat.如果您正在扫描的目录中有符号链接目录,这将更改程序的行为.以前的行为是(链接的)子目录将其文件计数添加到总计数中; 新行为是链接目录将计为单个文件,其内容将不计算在内.
如果文件的路径太长,将发出错误消息,程序将停止.

编辑2017-06-29

运气好的话,这将是这个答案的最后编辑:)

我已将此代码复制到GitHub存储库中,以便更容易获取代码(而不是复制/粘贴,您只需下载源代码),此外,任何人都可以通过提交拉动来更轻松地建议修改 - 从GitHub请求.

该源可在Apache License 2.0下获得.补丁^* 欢迎!

"补丁"是像我这样的老人所谓的"拉动请求".

太棒了！谢谢！对于那些不知道的人：您可以在终端中编译上述代码：`gcc -o dircnt dircnt.c` 并使用如下`./dircnt some_dir` (2认同)

你试过找吗？例如:

find . -name "*.ext" | wc -l

Run Code Online (Sandbox Code Playgroud)

如果他只想要当前目录,而不是递归的整个树,他可以添加-maxdepth 1选项来查找. (11认同)
似乎`find`比`ls'更快的原因是因为你如何使用`ls`.如果你停止排序,`ls`和`find`有相似的表现. (3认同)
您可以通过仅打印一个字符来加快 find + wc 的速度：`find 。-printf x | 厕所-c`。否则，您将从整个路径创建字符串并将其传递给 wc （额外的 I/O）。 (3认同)
无论如何，您应该使用 @ives 所示的“-printf”，因此当某些小丑写入带有换行符的文件名时，计数是正确的。 (2认同)

find,ls和perl针对40 000个文件进行了测试:速度相同(虽然我没有尝试清除缓存):

[user@server logs]$ time find . | wc -l
42917

real    0m0.054s
user    0m0.018s
sys     0m0.040s
[user@server logs]$ time /bin/ls -f | wc -l
42918

real    0m0.059s
user    0m0.027s
sys     0m0.037s

Run Code Online (Sandbox Code Playgroud)

并使用perl opendir/readdir,同时:

[user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
42918

real    0m0.057s
user    0m0.024s
sys     0m0.033s

Run Code Online (Sandbox Code Playgroud)

注意:我使用/ bin/ls -f确保绕过alias选项,这可能会减慢一点,-f以避免文件排序.没有-f的ls比find/perl慢两倍,除非ls与-f一起使用,它似乎是同一时间:

[user@server logs]$ time /bin/ls . | wc -l
42916

real    0m0.109s
user    0m0.070s
sys     0m0.044s

Run Code Online (Sandbox Code Playgroud)

我还想有一些脚本直接询问文件系统而没有所有不必要的信息.

根据Peter van der Heijden,glenn jackman和mark4o的回答进行测试.

托马斯

你绝对应该清除测试之间的缓存.我第一次运行`ls -l | wc -l`在带有1M文件的外部2.5"HDD上的文件夹上,完成操作大约需要3分钟.第二次需要12秒IIRC.这也可能取决于你的文件系统.我是使用`Btrfs`. (5认同)

令我惊讶的是，一个简单的发现与 ls -f 非常相似

> time ls -f my_dir | wc -l
17626

real    0m0.015s
user    0m0.011s
sys     0m0.009s

Run Code Online (Sandbox Code Playgroud)

相对

> time find my_dir -maxdepth 1 | wc -l
17625

real    0m0.014s
user    0m0.008s
sys     0m0.010s

Run Code Online (Sandbox Code Playgroud)

当然，每次执行其中任何一个时，小数点后第三位的值都会移动一点，因此它们基本相同。但是请注意，它find返回一个额外的单位，因为它计算实际目录本身（并且，如前所述，ls -f返回两个额外的单位，因为它也计算 . 和 ..）。

ls花费更多时间对文件名进行排序。使用-f禁用的排序，这将节省一些时间：

ls -f | wc -l

Run Code Online (Sandbox Code Playgroud)

或者你可以使用find：

find . -type f | wc -l

Run Code Online (Sandbox Code Playgroud)

您可以根据您的要求更改输出,但这里是一个bash单行程序,我写的是递归计数并报告一系列数字命名目录中的文件数.

dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }

Run Code Online (Sandbox Code Playgroud)

这将递归查看给定目录中的所有文件(而不是目录),并以类似哈希的格式返回结果.对find命令的简单调整可以使你想要的文件类型更加具体,等等.

结果是这样的:

1 => 38,
65 => 95052,
66 => 12823,
67 => 10572,
69 => 67275,
70 => 8105,
71 => 42052,
72 => 1184,

Run Code Online (Sandbox Code Playgroud)

我发现这个例子有点令人困惑。我想知道为什么左边有数字，而不是目录名称。不过，谢谢你，我最终通过一些小的调整来使用它。（计算目录并删除基本文件夹名称。for i in $(ls -1 . | sort -n) ; { echo "$i => $(find ${i} | wc -l)"; } (2认同)

快速 Linux 文件计数

我所知道的最快的 Linux 文件数是

locate -c -r '/home'

Run Code Online (Sandbox Code Playgroud)

有没有必要调用grep的！但如前所述，您应该拥有一个新的数据库（由 cron 作业每天更新，或由手动更新sudo updatedb）。

从人定位

-c, --count
    Instead  of  writing  file  names on standard output, write the number of matching
    entries only.

Run Code Online (Sandbox Code Playgroud)

另外，您应该知道它也将目录计为文件！

顺便说一句：如果您想了解系统类型上的文件和目录

locate -S

Run Code Online (Sandbox Code Playgroud)

它输出目录、文件等的数量。

大声笑，如果您已经拥有数据库中的所有计数，那么您当然可以快速计数。:) (2认同)

归档时间：	16 年，1 月前
查看次数：	85364 次
最近记录：	6 年，7 月前

如何基于通配符匹配以递归方式查找当前和子文件夹中的所有文件？ 1695

linux存储我的syslog在哪里？ 71

如何在Linux机器上以root用户身份运行Elasticsearch 2.1.1 24

使用find - 删除除了任何一个之外的所有文件/目录(在Linux中) 14

在Vim中的缓冲区内运行R. 11

通过bjdwp在Linux上调试黑莓 11

如何在bash脚本中授予MySQL权限？ 4

如果路径包含空格，则 adb 命令无法执行 3

无法通过 shell 在 CentOS 7 上安装 phpMyAdmin 3

我想使用 Jq 将文本文件数据转换为 JSON 3

"px","dip","dp"和"sp"之间有什么区别？ 5676

在Python中调用外部命令 4553

使用'for'循环迭代字典 2901

如何在JavaScript中获取查询字符串值？ 2701

如何检查对象是否是数组？ 2581

O(log n)究竟意味着什么？ 2021

如何将元素移动到另一个元素？ 1611

如何在Bash中将变量设置为命令的输出？ 1513

如何将分离的HEAD与master/origin协调？ 1506

Objective-C中的typedef枚举是什么？ 1081