我正在尝试编写一个pig latin脚本来提取我已经过滤的数据集的计数.
到目前为止这是脚本:
/* scans by title */
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
scancount = FOREACH productscans GENERATE COUNT($0);
DUMP scancount;
Run Code Online (Sandbox Code Playgroud)
出于某种原因,我收到错误:
无法将org.apache.pig.builtin.COUNT的匹配函数推断为多个或不适合.请使用明确的演员.
我在这做错了什么?我假设它与我传入的字段类型有关,但我似乎无法解决这个问题.
TIA,Jason
Chr*_*ite 14
这就是你要找的东西(所有人都将所有东西放在一个袋子里,然后计算物品):
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT(productscans);
dump count;
Run Code Online (Sandbox Code Playgroud)
COUNT 需要前面的GROUP ALL语句用于全局计数,而GROUP BY语句用于组计数。
您可以使用以下任何一种:
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT(productscans);
DUMP scancount;
Run Code Online (Sandbox Code Playgroud)
要么
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT($1);
DUMP scancount;
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
12550 次 |
| 最近记录: |