PIG:从一个分组的包中取出所有元组

Voj*_*ěch 5 apache-pig

我正在使用PIG从元组生成组,如下所示:

a1, b1
a1, b2
a1, b3
...

->

a1, [b1, b2, b3]
...
Run Code Online (Sandbox Code Playgroud)

这很容易且有效.但我的问题是获得以下内容:从获得的组中,我想在组的包中生成一组所有元组:

a1, [b1, b2, b3]

->

b1,b2
b1,b3
b2,b3
Run Code Online (Sandbox Code Playgroud)

如果我可以嵌套"foreach"并首先迭代每个组然后遍历它的包,这将很容易.

我想我误解了这个概念,我将非常感谢你的解释.

谢谢.

ale*_*pab 15

看起来你需要在包和它自身之间有一个笛卡尔积.要做到这一点,你需要使用FLATTEN(袋)两次.

码:

inpt = load '.../group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as value_bag;
result = foreach id_grp generate id, FLATTEN(value_bag) as v1, FLATTEN(value_bag) as v2; 
dump result;
Run Code Online (Sandbox Code Playgroud)

请注意,大袋会产生很多行.为了避免它,你可以在FLATTEN之前使用TOP(...):

inpt = load '....group.txt' using PigStorage(',')  as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
    generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2; 
};
dump result;
Run Code Online (Sandbox Code Playgroud)

对于您的特定输出,您可以在FLATTEN之前使用一些过滤:

inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    l = filter values by val == 'b1' or val == 'b2';
    generate id, FLATTEN(l) as v1, FLATTEN(values) as v2; 
};
result = filter result by v1 != v2;
Run Code Online (Sandbox Code Playgroud)

我希望它有所帮助.

干杯