使用猪或hadoop找到意思

Moh*_*hit 7 hadoop apache-pig

我有一个巨大的文本文件

数据保存在目录data/data1.txt,data2.txt等中

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..
Run Code Online (Sandbox Code Playgroud)

我想做的是每个商家,找到平均数量..

所以基本上我最终想要将输出保存在文件中.就像是

 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.
Run Code Online (Sandbox Code Playgroud)

我如何计算标准偏差?

很抱歉提出这样一个基本问题.:( 任何帮助,将不胜感激. :)

ale*_*pab 13

Apache PIG非常适合此类任务.见例子:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};
Run Code Online (Sandbox Code Playgroud)

要特别注意amnt列的数据类型,因为它会影响SUM函数PIG要调用的实现.

PIG也可以执行SQL无法做到的事情,它可以在不使用任何内部联接的情况下对每个输入行设置平均值.如果您使用标准偏差计算z分数,这将非常有用.

 mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};
Run Code Online (Sandbox Code Playgroud)

FLATTEN(inpt)可以解决问题,现在您可以访问对组平均值,总和和计数做出贡献的原始金额.

更新1:

计算方差和标准差:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;
Run Code Online (Sandbox Code Playgroud)

它将使用2个工作.我还没弄明白怎么做,嗯,需要花更多的时间在上面.