将事件归结为时间间隔

fex*_*fex 5 events logging mapreduce reducing

场景:我有一个记录此CSV示例中的事件的服务:

#TimeStamp, Name, ColorOfPullover
TimeStamp01, Peter, Green
TimeStamp02, Bob, Blue
TimeStamp03, Peter, Green
TimeStamp04, Peter, Red
TimeStamp05, Peter, Green
Run Code Online (Sandbox Code Playgroud)

例如彼得穿绿色的事件将经常连续发生.

我有两个目标:

  1. 保持数据尽可能小
  2. 保留所有相关数据

相关办法:我想知道,在这种时间跨度一个人穿什么颜色.例如:

#StartTime, EndTime, Name, ColorOfPullover
TimeStamp01, TimeStamp03, Peter, Green
TimeStamp02, TimeStamp02, Bob, Blue
TimeStamp03, TimeStamp03, Peter, Green
TimeStamp04, TimeStamp04, Peter, Red
TimeStamp05, TimeStamp05, Peter, Green
Run Code Online (Sandbox Code Playgroud)

在这种格式中,我可以回答以下问题:Peter在TimeStamp02时穿的是哪种颜色?(我可以放心地假设每个人在相同颜色的两个记录事件之间穿着相同的颜色.)

主要问题:我可以使用现有技术来实现这一目标吗?即我可以提供连续的事件流,并提取和存储相关数据?


确切地说,我需要实现这样的算法(伪代码).OnNewEvent为CSV示例的每一行调用该方法.其中参数event已包含来自行的数据作为成员变量.

def OnNewEvent(even)
    entry = Database.getLatestEntryFor(event.personName)
    if (entry.pulloverColor == event.pulloverColor)
        entry.setIntervalEndDate(event.date)
        Database.store(entry)
    else
        newEntry = new Entry
        newEntry.setIntervalStartDate(event.date)
        newEntry.setIntervalEndDate(event.date)
        newEntry.setPulloverColor(event.pulloverColor))
        newEntry.setName(event.personName)
        Database.createNewEntry(newEntry)
    end
end
Run Code Online (Sandbox Code Playgroud)

Kra*_*tam 0

This is typical scenario of any streaming architecture.  

There are multiple existing technologies which work in tandem  to get what you want. 


1.  NoSql Database (Hbase, Aerospike, Cassandra)
2.  streaming jobs Like Spark streaming(micro batch), Storm 
3.  Run mapreduce in micro batch to insert into NoSql Database.
4.  Kafka Distriuted queue

The end to end flow. 

Data -> streaming framework -> NoSql Database. 
OR 
Data -> Kafka -> streaming framework -> NoSql Database. 


IN NoSql database there are two ways to model your data. 
1. Key by "Name" and for every event for that given key, insert into Database.
   While fetching u get back all events corresponding to that key. 

2. Key by "name", every time a event for key is there, do a UPSERT into a existing blob(Object saved as binary), Inside the blob you maintain the time range and color seen.  

Code sample to read and write to Hbase and Aerospike  
Run Code Online (Sandbox Code Playgroud)

Hbase: http: //bytepadding.com/hbase/

Aerospike: http: //bytepadding.com/aerospike/