假设我有一个餐厅评论数据集:
User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5
Run Code Online (Sandbox Code Playgroud)
我想根据用户和城市的平均评论生成一个列表.即输出:
User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75
Run Code Online (Sandbox Code Playgroud)
我可以编写一个Pig脚本,如下所示:
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float
);
PerUserCity = GROUP Data BY (user, city);
ResultSet = FOREACH PerUserCity {
GENERATE group.user, group.city, AVG(Data.rating);
}
Run Code Online (Sandbox Code Playgroud)
但是我很好奇我是否可以先对更高级别的组(用户)进行分组,然后再对下一级(城市)进行分组:即
PerUser = GROUP Data BY user;
Intermediate = FOREACH PerUser {
B = GROUP Data BY city;
GENERATE group AS user, B;
}
Run Code Online (Sandbox Code Playgroud)
我明白了:
Error during parsing.
Invalid alias: GROUP in {
group: chararray,
Data: {
user: chararray,
city: chararray,
restaurant: chararray,
rating: float
}
}
Run Code Online (Sandbox Code Playgroud)
有人试过这个成功吗?是否根本不可能在FOREACH中进行GROUP?
我的目标是做一些事情:
ResultSet = FOREACH PerUser {
FOREACH City {
GENERATE user, city, AVG(City.rating)
}
}
Run Code Online (Sandbox Code Playgroud)