如何在Hadoop程序中的映射器中获取输入文件名？

Question

如何在Hadoop程序中的映射器中获取输入文件名？

如何在映射器中获取输入文件的名称？我有多个输入文件存储在输入目录中,每个映射器可能会读取不同的文件,我需要知道映射器已读取的文件.

Answer 1

首先,您需要获得输入拆分,使用较新的mapreduce API,它将按如下方式完成:

context.getInputSplit();

Run Code Online (Sandbox Code Playgroud)

但是为了获取文件路径和文件名,您需要首先将结果类型转换为FileSplit.

因此,为了获取输入文件路径,您可以执行以下操作:

Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();

Run Code Online (Sandbox Code Playgroud)

同样,要获取文件名,您可以调用getName(),如下所示:

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

Run Code Online (Sandbox Code Playgroud)

确保你选择了正确的类来包括(mapred vs mapreduce) (2认同)

Answer 2

Tar*_*riq 14

在mapper中使用:

FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();

Run Code Online (Sandbox Code Playgroud)

编辑:

如果您想通过旧API在configure()中执行此操作,请尝试以下操作:

String fileName = new String();
public void configure(JobConf job)
{
   filename = job.get("map.input.file");
}

Run Code Online (Sandbox Code Playgroud)

Answer 3

YaO*_*OzI 11

如果您使用的是Hadoop Streaming,则可以在流式作业的mapper/reducer中使用JobConf变量.

至于mapper的输入文件名,请参阅Configured Parameters部分,map.input.file变量(地图正在读取的文件名)是可以完成作业的人.但请注意:

注意:在执行流作业期间,将转换"mapred"参数的名称.点(.)变为下划线(_).例如,mapred.job.id变为mapred_job_id,mapred.jar变为mapred_jar.要获取流作业的映射器/缩减器中的值,请使用带下划线的参数名称.

例如,如果您使用的是Python,则可以将此行放在mapper文件中:

import os
file_name = os.getenv('map_input_file')
print file_name

Run Code Online (Sandbox Code Playgroud)

这在本地工作,但在使用Yarn的EMR中,我需要使用http://stackoverflow.com/questions/20915569/how-can-to-get-the-filename-from-a-streaming-mapreduce-job中的建议-in-r具体来说:`os.getenv('mapreduce_map_input_file')` (3认同)

Answer 4

小智 5

如果您使用常规输入格式，请在映射器中使用它：

InputSplit is = context.getInputSplit();
Method method = is.getClass().getMethod("getInputSplit");
method.setAccessible(true);
FileSplit fileSplit = (FileSplit) method.invoke(is);
String currentFileName = fileSplit.getPath().getName()

Run Code Online (Sandbox Code Playgroud)

如果您使用CombineFileInputFormat，这是一种不同的方法，因为它将几个小文件组合成一个相对较大的文件（取决于您的配置）。Mapper 和 RecordReader 都运行在同一个 JVM 上，因此您可以在运行时在它们之间传递数据。您需要实现自己的CombineFileRecordReaderWrapper并执行以下操作：

public class MyCombineFileRecordReaderWrapper<K, V> extends RecordReader<K, V>{
...
private static String mCurrentFilePath;
...
public void initialize(InputSplit combineSplit , TaskAttemptContext context) throws IOException, InterruptedException {
        assert this.fileSplitIsValid(context);
        mCurrentFilePath = mFileSplit.getPath().toString();
        this.mDelegate.initialize(this.mFileSplit, context);
    }
...
public static String getCurrentFilePath() {
        return mCurrentFilePath;
    }
...

Run Code Online (Sandbox Code Playgroud)

然后，在您的映射器中，使用：

String currentFileName = MyCombineFileRecordReaderWrapper.getCurrentFilePath()

Run Code Online (Sandbox Code Playgroud)

希望我有帮助:-)

归档时间：	12 年，2 月前
查看次数：	42602 次
最近记录：	7 年，4 月前