如何在hive中转置/转换数据?

Sun*_*nny 17 hadoop hive

我知道在hive中没有直接的方法来转置数据.我遵循了这个问题:有没有办法在Hive中转置数据?,但由于那里没有最终的答案,无法一路走来.

这是我的表格:

 | ID   |   Code   |  Proc1   |   Proc2 | 
 | 1    |    A     |   p      |   e     | 
 | 2    |    B     |   q      |   f     |
 | 3    |    B     |   p      |   f     |
 | 3    |    B     |   q      |   h     |
 | 3    |    B     |   r      |   j     |
 | 3    |    C     |   t      |   k     |
Run Code Online (Sandbox Code Playgroud)

这里Proc1可以有任意数量的值.ID,Code和Proc1一起构成此表的唯一键.我想透视/转置此表,以便Proc1中的每个唯一值成为新列,并且Proc2中的相应值是该列中相应行的值.在本质上,我试图得到类似的东西:

 | ID   |   Code   |  p   |   q |  r  |   t |
 | 1    |    A     |   e  |     |     |     |
 | 2    |    B     |      |   f |     |     |
 | 3    |    B     |   f  |   h |  j  |     |
 | 3    |    C     |      |     |     |  k  |
Run Code Online (Sandbox Code Playgroud)

在新转换的表中,ID和代码是唯一的主键.从我上面提到的票证中,我可以使用to_map UDAF来实现这一点.(免责声明 - 这可能不是朝着正确方向迈出的一步,但只要提到这里,如果是的话)

 | ID   |   Code   |  Map_Aggregation   | 
 | 1    |    A     |   {p:e}            |
 | 2    |    B     |   {q:f}            |
 | 3    |    B     |   {p:f, q:h, r:j } |  
 | 3    |    C     |   {t:k}            |
Run Code Online (Sandbox Code Playgroud)

但是不知道如何从这一步到达我想要的数据/转置表.任何有关如何进行的帮助都会很棒!谢谢.

小智 13

这是我用hive的内部UDF函数"map"解决这个问题的方法:

select
    b.id,
    b.code,
    concat_ws('',b.p) as p,
    concat_ws('',b.q) as q,
    concat_ws('',b.r) as r,
    concat_ws('',b.t) as t
from 
    (
        select id, code,
        collect_list(a.group_map['p']) as p,
        collect_list(a.group_map['q']) as q,
        collect_list(a.group_map['r']) as r,
        collect_list(a.group_map['t']) as t
        from (
            select
              id,
              code,
              map(proc1,proc2) as group_map 
            from 
              test_sample
        ) a
        group by
            a.id,
            a.code
    ) b;
Run Code Online (Sandbox Code Playgroud)

"concat_ws"和"map"是hive udf,"collect_list"是一个蜂巢udaf.


myu*_*yui 6

另一个解决方案。

使用Hivemall to_map函数进行数据透视

SELECT
  uid,
  kv['c1'] AS c1,
  kv['c2'] AS c2,
  kv['c3'] AS c3
FROM (
  SELECT uid, to_map(key, value) kv
  FROM vtable
  GROUP BY uid
) t
Run Code Online (Sandbox Code Playgroud)

uid c1 c2 c3 101 11 12 13 102 21 22 23

取消枢纽

SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
  'c1', c1,
  'c2', c2,
  'c3', c3
)) t2 as key, value
Run Code Online (Sandbox Code Playgroud)

uid key value 101 c1 11 101 c2 12 101 c3 13 102 c1 21 102 c2 22 102 c3 23


Sun*_*nny 5

这是我最终使用的解决方案:

add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';

select 
    id, 
    code,
    group_map['p'] as p,
    group_map['q'] as q,
    group_map['r'] as r,
    group_map['t'] as t
    from ( select
        id, code,
        collect(proc1,proc2) as group_map 
        from test_sample 
        group by id, code
    ) gm;
Run Code Online (Sandbox Code Playgroud)

to_map UDF用于brickhouse repo:https://github.com/klout/brickhouse