所有映射任务的Hadoop缓存文件

Question

所有映射任务的Hadoop缓存文件

joj*_*oba 4 java file-io hadoop mapreduce distributed-cache

我的map函数必须为每个输入读取一个文件.该文件根本没有变化,仅供阅读.分布式缓存可能对我有很多帮助,但我无法找到使用它的方法.我需要覆盖的public void configure(JobConf conf)函数,我认为已弃用.好的JobConf肯定已被弃用.所有DistributedCache教程都使用不推荐的方式.我能做什么？有没有我可以覆盖的另一个配置功能？

这些是我的地图功能的第一行:

     Configuration conf = new Configuration();          //load the MFile
     FileSystem fs = FileSystem.get(conf);
     Path inFile = new Path("planet/MFile");       
     FSDataInputStream in = fs.open(inFile);
     DecisionTree dtree=new DecisionTree().loadTree(in);

Run Code Online (Sandbox Code Playgroud)

我想缓存那个MFile,这样我的map函数就不需要一遍又一遍地查看它

Answer 1

joj*_*oba 5

我想,我做到了.我跟着Ravi Bhatt提示,我写了这个:

  @Override
  protected void setup(Context context) throws IOException, InterruptedException
  {      
      FileSystem fs = FileSystem.get(context.getConfiguration());
      URI files[]=DistributedCache.getCacheFiles(context.getConfiguration());
      Path path = new Path(files[0].toString());
      in = fs.open(path);
      dtree=new DecisionTree().loadTree(in);                 
  }

Run Code Online (Sandbox Code Playgroud)

在我的主要方法中,我这样做,将其添加到缓存中:

  DistributedCache.addCacheFile(new URI(args[0]+"/"+"MFile"), conf);
  Job job = new Job(conf, "MR phase one");

Run Code Online (Sandbox Code Playgroud)

我能够以这种方式检索我需要的文件,但是无法告诉它是否100%正常工作.有没有办法测试它？谢谢.

归档时间：	13 年，8 月前
查看次数：	3763 次
最近记录：	13 年，8 月前