如何计算hadoop中特定目录下的文件数？

Question

如何计算hadoop中特定目录下的文件数？

我是 map-reduce 框架的新手。我想通过提供该目录的名称来找出特定目录下的文件数。例如，假设我们有 3 个目录 A、B、C，每个目录分别有 20、30、40 个 part-r 文件。所以我有兴趣编写一个 hadoop 作业，它将计算每个目录中的文件/记录，即我想要在下面格式化的 .txt 文件中输出：

A 有 20 条记录

B 有 30 条记录

C 有 40 条记录

这些所有目录都存在于 HDFS 中。

Answer 1

Pet*_*tro 6

最简单/本机的方法是使用内置 hdfs 命令，在本例中-count：

hdfs dfs -count /path/to/your/dir  >> output.txt

Run Code Online (Sandbox Code Playgroud)

或者，如果您更喜欢通过 Linux 命令的混合方法：

hadoop fs -ls /path/to/your/dir/*  | wc -l >> output.txt

Run Code Online (Sandbox Code Playgroud)

最后，MapReduce 版本已经在这里得到了解答：

如何统计MR作业中HDFS中的文件数量？

代码：

int count = 0;
FileSystem fs = FileSystem.get(getConf());
boolean recursive = false;
RemoteIterator<LocatedFileStatus> ri = fs.listFiles(new Path("hdfs://my/path"), recursive);
while (ri.hasNext()){
    count++;
    ri.next();
}
System.out.println("The count is: " + count);

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，5 月前
查看次数：	7393 次
最近记录：	8 年，1 月前