在Hadoop分布式缓存中创建和放置文件

Question

在Hadoop分布式缓存中创建和放置文件

whi*_*fin 1 java caching hadoop mapreduce

我正在尝试动态创建资源并将其放置在Hadoop分布式缓存中，然后再运行我的工作-aka这将是一项自动化的工作，需要在映射器执行之前（通过HTTP）将几件事收集在一起。

我面临的问题是，由于我正在运行的映射器数量众多，我无法将其置于设置阶段-这将导致被调用服务器的负载过高。我希望能够检索我的资源，将它们写入文件，然后将其放在“分布式缓存”中，以便以后访问。

大节点：我不希望将文件写入Hadoop的，我宁愿它本地的节点上。

    // The whitelist cache file
    File resourceFile = new File("resources.json");

    // Create an output stream
    FileOutputStream outputStream = new FileOutputStream(resourceFile.getAbsoluteFile());

    // Write the whitelist to the local file
    // (this is using Jackson JSON, FYI)
    mapper.writeValue(outputStream, myResources);

    // Add the file to the job
    job.addCacheFile(new URI("file://" + resourceFile.getAbsolutePath()));

Run Code Online (Sandbox Code Playgroud)

这在run()我的工作方法中运行，即在映射器开始之前-但是每当我尝试new File("resources.json")在映射器中进行访问时，它都会给我FileNotFoundException 。

创建这些临时文件的正确方法是什么，以及在作业中访问它们的最佳方法是什么？

Answer 1

小智 5

尝试将其放入分布式缓存中：

_job.addCacheFile(new URI(filePath+"#"+filename));

Run Code Online (Sandbox Code Playgroud)

其中filename是文件在分布式缓存上将具有的名称。

在Mapper上读取文件是这样的：

Path path = new Path (filename);
FileSystem fs = FileSystem.getLocal(context.getConfiguration());

BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，1 月前
查看次数：	1991 次
最近记录：	10 年，1 月前