配置单元:UDF和GROUP BY

Sri*_*vas 5 hive hiveql

我有一个返回扩展名的UDF(GetUrlExt)。(例如:/abc/models/xyz/images/top.jpg中的jpg)。数据如下所示:

Date Time TimeTaken uristem  
9/5/2011 0:00:10 234 /abc/models/xyz/images/top.jpg  
9/5/2011 0:00:11 456 /abc/models/xyz/images/bottom.jpg  
9/5/2011 0:00:14 789 /abc/models/xyz/images/left.gif  
9/5/2011 0:00:16 234 /abc/models/xyz/images/top.pdf  
9/5/2011 0:00:18 734 /abc/models/xyz/images/top.pdf  
9/5/2011 0:00:19 654 /abc/models/xyz/images/right.gif  
9/5/2011 0:00:21 346 /abc/models/xyz/images/top.pdf  
9/5/2011 0:00:24 556 /abc/models/xyz/images/front.pdf  
9/5/2011 0:00:26 134 /abc/models/xyz/images/back.jpg
Run Code Online (Sandbox Code Playgroud)

没有“ GROUP BY”的查询工作正常:

SELECT GetUrlExt(uristem) AS extn FROM LogTable; 
Run Code Online (Sandbox Code Playgroud)

结果: jpg jpg gif pdf pdf gif pdf pdf pdf jpg

现在,我需要在GetUrlExt UDF的结果上显示“ GROUP BY”。
预期结果:
jpg 3 274.6
gif 2 721.5
pdf 4 467.5

但是以下查询不起作用:

SELECT GetUrlExt(uristem) AS extn, Count(*) AS PerCount, Avg(TimeTaken) AS AvgTime FROM LogTable GROUP BY extn;
Run Code Online (Sandbox Code Playgroud)

任何帮助表示赞赏!

pen*_*nsz 7

请使用子查询进行分组。

Hive不直接支持按计算值分组。

SELECT a.extn, Count(*) AS PerCount, Avg(TimeTaken) AS AvgTime 
FROM
(
    SELECT GetUrlExt(uristem) AS extn, TimeTaken
    FROM LogTable 
) a
GROUP BY a.extn;
Run Code Online (Sandbox Code Playgroud)