为基于事件的分析设计数据库模式

Question

为基于事件的分析设计数据库模式

CCS*_*Sab 11 mysql sql database analytics database-design

我正在试图找出为我正在编写的基于事件的分析系统建模模型的最佳方法.我主要担心的是以一种简单快速的查询方式编写本文.我也将使用MySQL.我将回顾一些要求,并提出一个可能(但我认为很差)架构的概述.

要求

跟踪事件(例如"APP_LAUNCH"事件的跟踪发生)
定义自定义事件
Ability to segment events on >1 custom properties (e.g. get occurrences of "APP_LAUNCH" segmented on the "APP_VERSION" property)
Track sessions
Perform queries based on timestamp range

Possible Modeling

The main problem that I'm having is how to model segmentation and the queries to perform to get the overall counts of an event.

My original idea was to define an EVENTS table with an id, int count, timestamp, property (?), and a foreign key to an EVENTTYPE. An EVENTTYPE has an id, name, and additional information belonging to a generic event type.

For example, the "APP_LAUNCH" event would have an entry in the EVENTS table with unique id, count representing the number of times the event happened, the timestamp (unsure about what this is stamped on), and a property or list of properties (e.g. "APP_VERSION", "COUNTRY", etc.) and a foreign key to an EVENTTYPE with name "APP_LAUNCH".

Comments and Questions

I'm pretty sure this isn't a good way to model this for the following reasons. It makes it difficult to do timestamp ranged queries ("Number of APP_LAUNCHES between time x and y"). The EVENTTYPE table doesn't really serve a purpose. Finally, I'm unsure as to how I would even perform queries for different segmentations. The last one is the one I'm most worried about.

I would appreciate any help in helping to correctly model this or in pointing me to resources that would help.

A final question (which is probably dumb): Is it bad to insert a row for every event? For example, say my client-side library makes the following call to my API:

track("APP_LAUNCH", {count: 4, segmentation: {"APP_VERSION": 1.0}})

Run Code Online (Sandbox Code Playgroud)

How would I actually store this in the table (this is closely related to the schema design obviously)? Is it bad to simply insert a row for each one of these calls, of which there may be a significant amount? My gut reaction is that I'm really interested mainly in the overall aggregated counts. I don't have enough experience with SQL to know how these queries perform over possibly hundreds of thousands of these entries. Would an aggregate table or a in-memory cache help to alleviate problems when I want the client to actually get the analytics?

I realize there are lots of questions here, but I would really appreciate any and all help. Thanks!

Answer 1

TMS*_*TMS 19

我认为你的大部分担忧都是不必要的.接下来提出一个问题:

1)最大的问题是自定义属性,每个事件都有所不同.为此,您必须使用EAV(实体 - 属性 - 值)设计.重要的问题是 - 这些属性有哪些类型？如果不止一个 - 例如字符串和整数,则更复杂.一般有两种类型的设计:

使用一个表和一列来表示所有类型的值 - 并将所有内容转换为字符串(不是可扩展的解决方案)
为每种数据类型都有单独的表(非常可扩展,我会这样做)

所以,表格看起来像:

Events             EventId int,  EventTypeId varchar,   TS timestamp
EventAttrValueInt  EventId int,  AttrName varchar,  Value int
EventAttrValueChar EventId int,  AttrName varchar,  Value varchar

Run Code Online (Sandbox Code Playgroud)

2)细分是什么意思？查询事件的各种参数？在上面提到的EAV设计中,您可以这样做:

select * 
from Events 
  join EventAttrValueInt  on Id = EventId and AttrName = 'APPVERSION' and Value > 4
  join EventAttrValueChar on Id = EventId and AttrName = 'APP_NAME' 
                                          and Value like "%Office%"
where EventTypeId = "APP_LAUNCH"

Run Code Online (Sandbox Code Playgroud)

这将选择APP_LACHCH类型的所有事件,其中APPVERSION> 4且APP_NAME包含"Office".

3) EVENTTYPE表可以达到一致性的目的,即您可以:

table EVENTS (.... EVENTTYPE_ID varchar - foreign key to EVENTTYPE ...)
table EVENTTYPE (EVENTTYPE_ID varchar)

Run Code Online (Sandbox Code Playgroud)

或者,您可以使用ID作为数字并在EVENTTYPE表中具有事件名称 - 这样可以节省空间并允许轻松地重命名事件,但是您需要在每个查询中加入此表(导致查询速度稍慢).取决于节省存储空间的优先级与较低的查询时间/简单性.

4)时间戳范围查询在您的设计中实际上非常简单:

select * 
from EVENTS
where EVENTTYPE_ID = "APP_LAUNCH" and TIMESTAMP > '2013-11-1'

Run Code Online (Sandbox Code Playgroud)

5) "为每个事件插入一行是不是很糟糕？"

这完全取决于你!如果您需要每个此类事件的时间戳和/或不同参数,那么您可能每个事件都应该有一行.如果存在大量具有相同类型和参数的事件,则可以执行大多数日志系统所执行的操作:聚合一行中发生的事件.如果你有这种直觉,那么这可能是一种方法.

6) "我没有足够的SQL经验来了解这些查询如何在数十万条条目中执行"

将毫无问题地处理数百或数千个此类条目.当你达到一个百万,你将不得不考虑更多的效率.

7) "当我希望客户端实际获得分析时,聚合表或内存缓存是否有助于缓解问题？"

当然,这也是一种解决方案,如果查询变慢并且您需要快速响应.但是,您必须引入一些机制来定期刷新缓存.它过于复杂了; 也许最好考虑在输入上聚合事件,见5).

归档时间：	12 年，5 月前
查看次数：	6534 次
最近记录：	11 年，4 月前