gho*_*sts 5 hadoop mapreduce input-split
我有一个日志文件如下
Begin ... 12-07-2008 02:00:05 ----> record1
incidentID: inc001
description: blah blah blah
owner: abc
status: resolved
end .... 13-07-2008 02:00:05
Begin ... 12-07-2008 03:00:05 ----> record2
incidentID: inc002
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc
status: resolved
end .... 13-07-2008 03:00:05
Run Code Online (Sandbox Code Playgroud)
我想用mapreduce来处理这个.我想提取事件ID,状态以及事件所需的时间
如何处理两个记录,因为它们具有可变记录长度以及在记录结束之前输入分割发生的情况.
您需要编写自己的输入格式和记录阅读器,以确保在记录分隔符周围正确分割文件.
基本上你的记录阅读器需要寻找它的分割字节偏移量,向前扫描(读取行)直到它找到:
Begin ...条线
end ...行的行,并在开始和结束之间提供这些行作为下一条记录的输入这与算法类似于Mahout的XMLInputFormat如何处理多行XML作为输入 - 实际上您可能能够直接修改此源代码以处理您的情况.
正如@ irW的回答中提到的,NLineInputFormat如果你的记录每条记录有固定的行数,则是另一种选择,但对于较大的文件来说效率非常低,因为它必须打开并读取整个文件以发现输入格式getSplits()方法中的行偏移.
| 归档时间: |
|
| 查看次数: |
3521 次 |
| 最近记录: |