小编gho*_*sts的帖子

如何读取分成多行的记录以及如何在输入拆分期间处理损坏的记录

我有一个日志文件如下

Begin ... 12-07-2008 02:00:05         ----> record1
incidentID: inc001
description: blah blah blah 
owner: abc 
status: resolved 
end .... 13-07-2008 02:00:05 
Begin ... 12-07-2008 03:00:05         ----> record2 
incidentID: inc002 
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc 
status: resolved 
end .... 13-07-2008 03:00:05
Run Code Online (Sandbox Code Playgroud)

我想用mapreduce来处理这个.我想提取事件ID,状态以及事件所需的时间

如何处理两个记录,因为它们具有可变记录长度以及在记录结束之前输入分割发生的情况.

hadoop mapreduce input-split

5
推荐指数
1
解决办法
3521
查看次数

标签 统计

hadoop ×1

input-split ×1

mapreduce ×1