我在 Hive 中有一张表,如下所示:
col1 col2
b 1
b 2
a 3
b 2
c 4
c 5
Run Code Online (Sandbox Code Playgroud)
我如何使用 hiveql 将col1元素组合在一起,将它们相加,按总和排序,以及基于总和创建累积总和 (csum)?
id sum_all csum
a 3 3
b 5 8
c 9 17
Run Code Online (Sandbox Code Playgroud)
我只设法提出了分组和总和,但我对累积总和没有想法。Hive 不支持相关子查询
select col1 as id
sum(col2) as sum_all
from t
group by col1
order by sum_all
Run Code Online (Sandbox Code Playgroud)
结果如下:
id sum_all
a 3
b 5
c 9
Run Code Online (Sandbox Code Playgroud)
由于不允许关联子查询,请尝试使用派生表然后连接它们。
select
a.id,
a.sum_all,
sum(b.sum_all) as csum
from
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) a
join
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) b
on
( b.sum_all < a.sum_all )
or ( b.sum_all = a.sum_all and b.id <= a.id )
group by
a.sum_all, a.id
order by
a.sum_all, a.id ;
Run Code Online (Sandbox Code Playgroud)
这本质上是对派生的 group-by 表的自联接。首先将分组结果保存到临时表中,然后进行自联接可能更有效。
根据手册,Hive 也有窗口聚合,所以你也可以使用它们:
select
a.id,
a.sum_all,
sum(a.sum_all) over (order by a.sum_all, a.id
rows between unbounded preceding
and current row)
as csum
from
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) a
order by
sum_all, id ;
Run Code Online (Sandbox Code Playgroud)
或与:
select
col1 as id,
sum(col2) as sum_all,
sum(sum(col2)) over (order by sum(col2), col1
rows between unbounded preceding
and current row)
as csum
from
t
group by
col1
order by
sum_all, id ;
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
9001 次 |
| 最近记录: |