我有一个偏斜的数据集,我需要按操作进行分组,然后对它进行嵌套的foreach.由于数据偏差,很少有减速机需要很长时间,而其他减速机则没有时间.我知道存在偏差连接但是对于分组和foreach有什么用?这是我的猪代码(重命名变量):
foo_grouped = GROUP foo_grouped by FOO;
FOO_stats = FOREACH foo_grouped
{
a_FOO_total = foo_grouped.ATTR;
a_FOO_total = DISTINCT a_FOO_total;
bar_count = foo_grouped.BAR;
bar_count = DISTINCT bar_count;
a_FOO_type1 = FILTER foo_grouped by COND1=='Y';
a_FOO_type1 = a_FOO_type1.ATTR;
a_FOO_type1 = DISTINCT a_FOO_type1;
a_FOO_type2 = FILTER foo_grouped by COND2=='Y' OR COND3=='HIGH';
a_FOO_type2 = a_FOO_type2.ATTR;
a_FOO_type2 = DISTINCT a_FOO_type2;
generate group as FOO,
COUNT(a_FOO_total) as a_FOO_total, COUNT(a_FOO_type1) as a_FOO_type1, COUNT(a_FOO_type2) as a_FOO_type2, COUNT(bar_count) as bar_count; }
Run Code Online (Sandbox Code Playgroud) apache-pig ×1