使用MapReduce中的globStatus过滤输入文件

aa8*_*a8y 2 java hadoop mapreduce cloudera

我有很多输入文件,我想根据最后附加的日期处理选定的文件.我现在很困惑我在哪里使用globStatus方法来过滤掉文件.

我有一个自定义的RecordReader类,我试图在其下一个方法中使用globStatus,但它没有成功.

public boolean next(Text key, Text value) throws IOException {
    Path filePath = fileSplit.getPath();

    if (!processed) {
        key.set(filePath.getName());

        byte[] contents = new byte[(int) fileSplit.getLength()];
        value.clear();
        FileSystem fs = filePath.getFileSystem(conf);
        fs.globStatus(new Path("/*" + date));
        FSDataInputStream in = null;

        try {
            in = fs.open(filePath);
            IOUtils.readFully(in, contents, 0, contents.length);
            value.set(contents, 0, contents.length);
        } finally {
            IOUtils.closeStream(in);
        }
        processed = true;
        return true;
    }
    return false;
}
Run Code Online (Sandbox Code Playgroud)

我知道它返回一个FileStatus数组,但我如何使用它来过滤文件.有人可以解释一下吗?

Cha*_*guy 10

globStatus方法采用2个免费参数,允许您过滤文件.第一个是glob模式,但有时glob模式不足以过滤特定文件,在这种情况下你可以定义一个PathFilter.

关于glob模式,支持以下内容:

Glob   | Matches
-------------------------------------------------------------------------------------------------------------------
*      | Matches zero or more characters
?      | Matches a single character
[ab]   | Matches a single character in the set {a, b}
[^ab]  | Matches a single character not in the set {a, b}
[a-b]  | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b}  | Matches either expression a or b
\c     | Matches character c when it is a metacharacter
Run Code Online (Sandbox Code Playgroud)

PathFilter 只是这样的界面:

public interface PathFilter {
    boolean accept(Path path);
}
Run Code Online (Sandbox Code Playgroud)

因此,您可以实现此接口并实现accept可以将逻辑过滤到文件的方法.

Tom White的优秀书籍中的一个示例,它允许您定义PathFilter过滤与特定正则表达式匹配的文件:

public class RegexExcludePathFilter implements PathFilter {
    private final String regex;

    public RegexExcludePathFilter(String regex) {
        this.regex = regex;
    }

    public boolean accept(Path path) {
        return !path.toString().matches(regex);
    }
}
Run Code Online (Sandbox Code Playgroud)

您可以在初始化作业时PathFilter通过调用直接过滤输入FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class).

编辑:因为你必须传入类setInputPathFilter,你不能直接传递参数,但你应该能够做类似的东西玩Configuration.如果你RegexExcludePathFilter也进行了扩展Configured,你可以Configuration使用所需的值返回一个之前已经初始化的对象,这样你就可以在过滤器中找回这些值并在其中处理它们accept.

例如,如果您初始化如下:

conf.set("date", "2013-01-15");
Run Code Online (Sandbox Code Playgroud)

然后你可以像这样定义你的过滤器:

public class RegexIncludePathFilter extends Configured implements PathFilter {
    private String date;
    private FileSystem fs;

    public boolean accept(Path path) {
        try {
            if (fs.isDirectory(path)) {
                return true;
            }
        } catch (IOException e) {}
        return path.toString().endsWith(date);
    }

    public void setConf(Configuration conf) {
        if (null != conf) {
            this.date = conf.get("date");
            try {
                this.fs = FileSystem.get(conf);
            } catch (IOException e) {}
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

编辑2:原始代码存在一些问题,请参阅更新的类.您还需要删除构造函数,因为它不再使用,并检查是否是一个目录,在这种情况下您应该返回true,以便也可以过滤目录的内容.