使用 hiveql 的累积和

klx*_*123 5 hive

我在 Hive 中有一张表,如下所示:

col1       col2
b           1
b           2
a           3
b           2
c           4
c           5
Run Code Online (Sandbox Code Playgroud)

我如何使用 hiveql 将col1元素组合在一起,将它们相加,按总和排序,以及基于总和创建累积总和 (csum)?

id       sum_all     csum
a         3           3
b         5           8
c         9           17
Run Code Online (Sandbox Code Playgroud)

我只设法提出了分组和总和,但我对累积总和没有想法。Hive 不支持相关子查询

select col1 as id
      sum(col2) as sum_all
from t
group by col1
order by sum_all
Run Code Online (Sandbox Code Playgroud)

结果如下:

id       sum_all
a         3
b         5
c         9
Run Code Online (Sandbox Code Playgroud)

ype*_*eᵀᴹ 5

由于不允许关联子查询,请尝试使用派生表然后连接它们。

select 
    a.id,
    a.sum_all,
    sum(b.sum_all) as csum
from
        ( select col1 as id,
                 sum(col2) as sum_all
          from t
          group by col1
        )  a
    join
        ( select col1 as id,
                 sum(col2) as sum_all
          from t
          group by col1
        )  b
     on
        ( b.sum_all < a.sum_all )
     or ( b.sum_all = a.sum_all and b.id <= a.id )
group by
    a.sum_all, a.id
order by 
    a.sum_all, a.id ;
Run Code Online (Sandbox Code Playgroud)

这本质上是对派生的 group-by 表的自联接。首先将分组结果保存到临时表中,然后进行自联接可能更有效。


根据手册,Hive 也有窗口聚合,所以你也可以使用它们:

select 
    a.id,
    a.sum_all,
    sum(a.sum_all) over (order by a.sum_all, a.id
                         rows between unbounded preceding
                                  and current row)
        as csum
from
        ( select col1 as id,
                 sum(col2) as sum_all
          from t
          group by col1
        )  a
order by 
    sum_all, id ;
Run Code Online (Sandbox Code Playgroud)

或与:

select 
    col1 as id,
    sum(col2) as sum_all,
    sum(sum(col2)) over (order by sum(col2), col1
                         rows between unbounded preceding
                                  and current row)
        as csum
from
    t
group by 
    col1
order by 
    sum_all, id ;
Run Code Online (Sandbox Code Playgroud)