我有一个巨大的文本文件
数据保存在目录data/data1.txt,data2.txt等中
merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
1234, 0124, 230
and so on..
Run Code Online (Sandbox Code Playgroud)
我想做的是每个商家,找到平均数量..
所以基本上我最终想要将输出保存在文件中.就像是
merchant_id, average_amount
1234, avg_amt_1234 a
and so on.
Run Code Online (Sandbox Code Playgroud)
我如何计算标准偏差?
很抱歉提出这样一个基本问题.:( 任何帮助,将不胜感激. :)
ale*_*pab 13
Apache PIG非常适合此类任务.见例子:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate group as id, sum/count as mean, sum as sum, count as count;
};
Run Code Online (Sandbox Code Playgroud)
要特别注意amnt列的数据类型,因为它会影响SUM函数PIG要调用的实现.
PIG也可以执行SQL无法做到的事情,它可以在不使用任何内部联接的情况下对每个输入行设置平均值.如果您使用标准偏差计算z分数,这将非常有用.
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};
Run Code Online (Sandbox Code Playgroud)
FLATTEN(inpt)可以解决问题,现在您可以访问对组平均值,总和和计数做出贡献的原始金额.
更新1:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
dif = (amnt - avg) * (amnt - avg) ;
generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum;
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;
Run Code Online (Sandbox Code Playgroud)
它将使用2个工作.我还没弄明白怎么做,嗯,需要花更多的时间在上面.
归档时间: |
|
查看次数: |
6845 次 |
最近记录: |