我知道在hive中没有直接的方法来转置数据.我遵循了这个问题:有没有办法在Hive中转置数据?,但由于那里没有最终的答案,无法一路走来.
这是我的表格:
| ID | Code | Proc1 | Proc2 |
| 1 | A | p | e |
| 2 | B | q | f |
| 3 | B | p | f |
| 3 | B | q | h |
| 3 | B | r | j |
| 3 | C | t | k |
Run Code Online (Sandbox Code Playgroud)
这里Proc1可以有任意数量的值.ID,Code和Proc1一起构成此表的唯一键.我想透视/转置此表,以便Proc1中的每个唯一值成为新列,并且Proc2中的相应值是该列中相应行的值.在本质上,我试图得到类似的东西:
| ID | Code | p | q | r | t |
| 1 | A | e | | | |
| 2 | B | | f | | |
| 3 | B | f | h | j | |
| 3 | C | | | | k |
Run Code Online (Sandbox Code Playgroud)
在新转换的表中,ID和代码是唯一的主键.从我上面提到的票证中,我可以使用to_map UDAF来实现这一点.(免责声明 - 这可能不是朝着正确方向迈出的一步,但只要提到这里,如果是的话)
| ID | Code | Map_Aggregation |
| 1 | A | {p:e} |
| 2 | B | {q:f} |
| 3 | B | {p:f, q:h, r:j } |
| 3 | C | {t:k} |
Run Code Online (Sandbox Code Playgroud)
但是不知道如何从这一步到达我想要的数据/转置表.任何有关如何进行的帮助都会很棒!谢谢.
小智 13
这是我用hive的内部UDF函数"map"解决这个问题的方法:
select
b.id,
b.code,
concat_ws('',b.p) as p,
concat_ws('',b.q) as q,
concat_ws('',b.r) as r,
concat_ws('',b.t) as t
from
(
select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from (
select
id,
code,
map(proc1,proc2) as group_map
from
test_sample
) a
group by
a.id,
a.code
) b;
Run Code Online (Sandbox Code Playgroud)
"concat_ws"和"map"是hive udf,"collect_list"是一个蜂巢udaf.
另一个解决方案。
使用Hivemall to_map
函数进行数据透视。
SELECT
uid,
kv['c1'] AS c1,
kv['c2'] AS c2,
kv['c3'] AS c3
FROM (
SELECT uid, to_map(key, value) kv
FROM vtable
GROUP BY uid
) t
Run Code Online (Sandbox Code Playgroud)
uid c1 c2 c3
101 11 12 13
102 21 22 23
取消枢纽
SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
'c1', c1,
'c2', c2,
'c3', c3
)) t2 as key, value
Run Code Online (Sandbox Code Playgroud)
uid key value
101 c1 11
101 c2 12
101 c3 13
102 c1 21
102 c2 22
102 c3 23
这是我最终使用的解决方案:
add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';
select
id,
code,
group_map['p'] as p,
group_map['q'] as q,
group_map['r'] as r,
group_map['t'] as t
from ( select
id, code,
collect(proc1,proc2) as group_map
from test_sample
group by id, code
) gm;
Run Code Online (Sandbox Code Playgroud)
to_map UDF用于brickhouse repo:https://github.com/klout/brickhouse