Joe*_*ate 6 hadoop mapreduce range apache-pig
我有一个数据集,A有时间戳,访问者,URL:
(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com)
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com)
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)
Run Code Online (Sandbox Code Playgroud)
我想在一个时间窗口(例如10分钟)内测量每个用户每个用户的访问次数,但是作为一个滚动窗口,按分钟递增.输出将是:
(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)
Run Code Online (Sandbox Code Playgroud)
为了简化算术,我将时间戳更改为一天中的分钟,如下所示:
(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */
Run Code Online (Sandbox Code Playgroud)
要通过移动时间窗口迭代'A',我在当天创建一个分钟数据集B:
(0)
(1)
(2)
.
.
.
.
(1440)
Run Code Online (Sandbox Code Playgroud)
理想情况下,我想做的事情如下:
A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)
foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}
DUMP B;
Run Code Online (Sandbox Code Playgroud)
我知道"FOREACH"循环中不允许"GROUP",但有没有解决方法来实现相同的结果?
谢谢!
也许你可以做这样的事情?
注意:这取决于您为整数日志创建的分钟数。如果不是,那么您可以四舍五入到最接近的分钟。
#!/usr/bin/python
@outputSchema('expanded: {(num:int)}')
def expand(start, end):
return [ (x) for x in range(start, end) ]
Run Code Online (Sandbox Code Playgroud)
register 'myudf.py' using jython as myudf ;
-- A1 is the minutes. Schema:
-- A1: {minute: int}
-- A2 is the logs. Schema:
-- A2: {minute: int,name: chararray}
-- These schemas should change to fit your needs.
B = FOREACH A1 GENERATE minute,
FLATTEN(myudf.expand(minute, minute+10)) AS matchto ;
-- B is in the form:
-- 1 1
-- 1 2
-- ....
-- 2 2
-- 2 3
-- ....
-- 100 100
-- 100 101
-- etc.
-- Now we join on the minute in the second column of B with the
-- minute in the log, then it is just grouping by the minute in
-- the first column and name and counting
C = JOIN B BY matchto, A2 BY minute ;
D = FOREACH (GROUP C BY (B::minute, name))
GENERATE FLATTEN(group), COUNT(C) as count ;
Run Code Online (Sandbox Code Playgroud)
我有点担心较大日志的速度,但它应该可以工作。如果您需要我解释任何事情,请告诉我。
| 归档时间: |
|
| 查看次数: |
1878 次 |
| 最近记录: |