I need to write a pig script where I am finding the average values of several columns and getting only those rows whose all column values are greater than the computed averages. My script is:
i2 = GROUP i1 all;
i3 = FOREACH i2 GENERATE AVG(i1.user_followers_count) AS avg_user_followers_count, AVG(i1.avl_user_follower_following_ratio) AS avg_avl_user_follower_following_ratio, AVG(i1.user_total_liked) AS avg_user_total_liked, AVG(i1.user_total_posts) AS avg_user_total_posts, AVG(i1.user_total_public_lists) AS avg_user_total_public_lists, AVG(i1.avl_user_total_retweets) AS avg_avl_user_total_retweets, AVG(i1.avl_user_total_likes) AS avl_user_total_likes, AVG(i1.avl_user_total_replies) AS avg_avl_user_total_replies, AVG(i1.avl_user_engagements) AS avl_avl_user_engagements, AVG(i1.user_reply_to_reply_count) AS avg_user_reply_to_reply_count;
top_inf = FILTER i1 BY (i1.user_followers_count > i3.avg_user_followers_count, i1.avl_user_total_retweets > i3. avg_avl_user_total_retweets, i1.avl_user_total_likes > i3.avg_avl_user_total_retweets);
Run Code Online (Sandbox Code Playgroud)
But this throws an error:
ERROR 1200: <file user.pig, line 70, column 103> mismatched input '>' expecting RIGHT_PAREN
Run Code Online (Sandbox Code Playgroud)
What is the right way to filter rows on multiple conditions?
使用AND分隔条件
top_inf = FILTER i1 BY (i1.user_followers_count > i3.avg_user_followers_count)
AND (i1.avl_user_total_retweets > i3.avg_avl_user_total_retweets)
AND (i1.avl_user_total_likes > i3.avg_avl_user_total_retweets);
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6838 次 |
| 最近记录: |