有效地将数据存储在Hive中

Question

有效地将数据存储在Hive中

如何在Hive中有效地存储数据,以及在hive中存储和检索压缩数据？目前我将其存储为TextFile.我正在阅读Bejoy文章,我发现LZO压缩对于存储文件很有用,而且它是可拆分的.

我有一个生成一些输出的HiveQL Select查询,我将该输出存储在某处,以便我的一个Hive表(质量)可以使用该数据,以便我可以查询该quality表.

下面是quality我通过使用我用来覆盖表的分区从下面的SELECT查询加载数据的表quality.

create table quality
(id bigint,
  total bigint,
  error bigint
 )
partitioned by (ds string)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/uname/quality'
;

insert overwrite table quality partition (ds='20120709')
SELECT id  , count2 , coalesce(error, cast(0 AS BIGINT)) AS count1  FROM Table1;

Run Code Online (Sandbox Code Playgroud)

所以这里目前我将它存储为a TextFile,我应该将其作为a Sequence file并开始存储数据LZO compression format吗？或者文本文件在这里也可以吗？从选择查询开始,我将获得一些GB数据,这些数据需要每天上传到桌面质量上.

那么哪种方式最好？我应该将输出存储为TextFile或SequenceFile格式(LZO压缩),这样当我查询Hive质量表时,我会得到结果严重.意味着查询速度更快.

更新: -

如果我使用块压缩存储为SequenceFile怎么办？如下 -

set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;

Run Code Online (Sandbox Code Playgroud)

我需要设置一些其他的东西来启用除上面的BLOCK压缩？而且我也将Table创建为SequenceFile格式

再次更新

我应该创建如下所示的表格？或者需要进行一些其他更改以使用Sequence File启用BLOCK压缩？

create table lipy
( buyer_id bigint,
  total_chkout bigint,
  total_errpds bigint
 )
 partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as sequencefile
location '/apps/hdmi-technology/lipy'
;

Run Code Online (Sandbox Code Playgroud)

Answer 1

ale*_*pab 1

我没有太多使用 Hive，但根据 Hadoop 和结构化数据的经验，我从具有 BLOCK 压缩的 SequenceFiles 中获得了最佳性能。默认是行压缩，但是当您存储结构化数据并且行不是特别大时，它的效率不如 BLOCK 压缩。要打开它，我使用了mapred.output.compression.type=BLOCK

归档时间：	13 年，9 月前
查看次数：	5253 次
最近记录：	13 年，9 月前