将目录而不是文件传递给hadoop-streaming?

Jon*_*ser 7 hadoop hadoop-streaming

在我的工作中,我需要解析许多历史日志集.个别客户(有数千个)可能有数百个按日期分类的日志子目录.例如:

  • 日志/ Customer_One/2011-01-02-001
  • 日志/ Customer_One/2012-02-03-001
  • 日志/ Customer_One/2012-02-03-002
  • 日志/ Customer_Two/2009-03-03-001
  • 日志/ Customer_Two/2009-03-03-002

每个单独的日志集本身可能是五个或六个级别,包含数千个文件.

因此,我实际上希望各个映射作业处理子目录的步骤:简单地枚举单个文件是我的分布式计算问题的一部分!

不幸的是,当我尝试将只包含log子目录的目录传递给Hadoop时,它抱怨我无法将这些子目录传递给我的mapper.(同样,我已经写过接受子目录作为输入):

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

是否有一种简单的方法来说服Hadoop-streaming允许我将目录分配为工作项?

Chr*_*ite 2

我想您需要研究编写一个自定义的InputFormat,您也可以传递根目录,它将为每个客户创建一个拆分,然后每个拆分的记录读取器将执行目录遍历并将文件内容推送到您的映射器