如何读取分成多行的记录以及如何在输入拆分期间处理损坏的记录

gho*_*sts 5 hadoop mapreduce input-split

我有一个日志文件如下

Begin ... 12-07-2008 02:00:05         ----> record1
incidentID: inc001
description: blah blah blah 
owner: abc 
status: resolved 
end .... 13-07-2008 02:00:05 
Begin ... 12-07-2008 03:00:05         ----> record2 
incidentID: inc002 
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc 
status: resolved 
end .... 13-07-2008 03:00:05
Run Code Online (Sandbox Code Playgroud)

我想用mapreduce来处理这个.我想提取事件ID,状态以及事件所需的时间

如何处理两个记录,因为它们具有可变记录长度以及在记录结束之前输入分割发生的情况.

Chr*_*ite 5

您需要编写自己的输入格式和记录阅读器,以确保在记录分隔符周围正确分割文件.

基本上你的记录阅读器需要寻找它的分割字节偏移量,向前扫描(读取行)直到它找到:

  • Begin ...条线
    • 读取下一end ...行的行,并在开始和结束之间提供这些行作为下一条记录的输入
  • 它扫描过去的分裂结束或找到EOF

这与算法类似于Mahout的XMLInputFormat如何处理多行XML作为输入 - 实际上您可能能够直接修改此源代码以处理您的情况.

正如@ irW的回答中提到的,NLineInputFormat如果你的记录每条记录有固定的行数,则是另一种选择,但对于较大的文件来说效率非常低,因为它必须打开并读取整个文件以发现输入格式getSplits()方法中的行偏移.