如何直接将mapper-reducer的输出发送到另一个mapper-reducer而不将输出保存到hdfs

Question

如何直接将mapper-reducer的输出发送到另一个mapper-reducer而不将输出保存到hdfs

解决问题最终检查我的解决方案在底部

最近我试图在Mahout in Action的chaper6(列出6.1~6.4)中运行推荐示例.但我遇到了一个问题,我已经google了一下,但我找不到解决方案.

这是问题所在:我有一对mapper-reducer

public final class WikipediaToItemPrefsMapper extends
    Mapper<LongWritable, Text, VarLongWritable, VarLongWritable> {

private static final Pattern NUMBERS = Pattern.compile("(\\d+)");

@Override
protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
    String line = value.toString();
    Matcher m = NUMBERS.matcher(line);
    m.find();
    VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
    VarLongWritable itemID = new VarLongWritable();
    while (m.find()) {
        itemID.set(Long.parseLong(m.group()));
        context.write(userID, itemID);
    }
}
}

public class WikipediaToUserVectorReducer
    extends
    Reducer<VarLongWritable, VarLongWritable, VarLongWritable, VectorWritable> {

public void reduce(VarLongWritable userID,
        Iterable<VarLongWritable> itemPrefs, Context context)
        throws IOException, InterruptedException {
    Vector userVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
    for (VarLongWritable itemPref : itemPrefs) {
        userVector.set((int) itemPref.get(), 1.0f);
    }
    context.write(userID, new VectorWritable(userVector));
}
}

Run Code Online (Sandbox Code Playgroud)

减速器输出一个用户ID和一个userVector和它看起来像这样:98955 {590:1.0 22:1.0 9059 1.0 3:1.0 2:1.0 1:1.0}

然后我想使用另一对mapper-reducer来处理这些数据

public class UserVectorSplitterMapper
    extends
    Mapper<VarLongWritable, VectorWritable, IntWritable, VectorOrPrefWritable> {

public void map(VarLongWritable key, VectorWritable value, Context context)
        throws IOException, InterruptedException {
    long userID = key.get();
    Vector userVector = value.get();
    Iterator<Vector.Element> it = userVector.iterateNonZero();
    IntWritable itemIndexWritable = new IntWritable();
    while (it.hasNext()) {
        Vector.Element e = it.next();
        int itemIndex = e.index();
        float preferenceValue = (float) e.get();
        itemIndexWritable.set(itemIndex);
        context.write(itemIndexWritable, 
                new VectorOrPrefWritable(userID, preferenceValue));
    }
}
}

Run Code Online (Sandbox Code Playgroud)

当我尝试运行这个工作时,它会抛出错误说

org.apache.hadoop.io.Text无法强制转换为org.apache.mahout.math.VectorWritable

第一映射器,减速器输出写入到HDFS,并且所述第二映射器,减速器尝试读取输出,映射器可以施放到98955 VarLongWritable,但不能将{590:1.0 22:1.0 9059 1.0 3: 1.0 2:1.0 1:1.0}到VectorWritable,所以我想知道有没有办法让第一个mapper-reducer直接将输出发送到第二对,那么就没有必要进行数据转换.我已经查看了Hadoop的行动,并且hadoop:权威的指南,似乎没有这样的方法可以做到这一点,任何建议？

问题解决了

解决方案:通过使用SequenceFileOutputFormat,我们可以输出并保存减少对DFS第一MapReduce的工作流程的结果,那么第二MapReduce的工作流程可以通过读取临时文件作为输入SequenceFileInputFormat创建映射时,类作为参数.由于矢量将保存在具有特定格式的二进制序列文件中,因此SequenceFileInputFormat可以读取它并将其转换回矢量格式.

以下是一些示例代码:

confFactory ToItemPrefsWorkFlow = new confFactory
            (new Path("/dbout"), //input file path
             new Path("/mahout/output.txt"), //output file path
             TextInputFormat.class, //input format
             VarLongWritable.class, //mapper key format
             Item_Score_Writable.class, //mapper value format
             VarLongWritable.class, //reducer key format
             VectorWritable.class, //reducer value format
             **SequenceFileOutputFormat.class** //The reducer output format             

    );
    ToItemPrefsWorkFlow.setMapper( WikipediaToItemPrefsMapper.class);
    ToItemPrefsWorkFlow.setReducer(WikipediaToUserVectorReducer.class);
    JobConf conf1 = ToItemPrefsWorkFlow.getConf();


    confFactory UserVectorToCooccurrenceWorkFlow = new confFactory
            (new Path("/mahout/output.txt"),
             new Path("/mahout/UserVectorToCooccurrence"),
             SequenceFileInputFormat.class, //notice that the input format of mapper of the second work flow is now SequenceFileInputFormat.class
             //UserVectorToCooccurrenceMapper.class,
             IntWritable.class,
             IntWritable.class,
             IntWritable.class,
             VectorWritable.class,
             SequenceFileOutputFormat.class                                      
             );
     UserVectorToCooccurrenceWorkFlow.setMapper(UserVectorToCooccurrenceMapper.class);
     UserVectorToCooccurrenceWorkFlow.setReducer(UserVectorToCooccurrenceReducer.class);
    JobConf conf2 = UserVectorToCooccurrenceWorkFlow.getConf();

    JobClient.runJob(conf1);
    JobClient.runJob(conf2);

Run Code Online (Sandbox Code Playgroud)

如果您有任何问题,请随时与我联系

Answer 1

Chr*_*ite 4

您需要显式配置第一个作业的输出以使用 SequenceFileOutputFormat 并定义输出键和值类：

job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputKeyClass(VarLongWritable.class);
job.setOutputKeyClass(VectorWritable.class);

Run Code Online (Sandbox Code Playgroud)

如果没有看到您的驱动程序代码，我猜您正在使用 TextOutputFormat 作为第一个作业的输出，并使用 TextInputFormat 作为第二个作业的输入 - 并且此输入格式将成对发送到<Text, Text>第二个映射器

归档时间：	13 年，6 月前
查看次数：	9333 次
最近记录：	9 年，10 月前