将压缩(lzo)数据从s3导入到hive

Question

将压缩(lzo)数据从s3导入到hive

ryn*_*nop 2 hive lzo amazon-web-services elastic-map-reduce emr

我将DynamoDB表导出为s3作为备份方式(通过EMR).当我导出时,我将数据存储为lzo压缩文件.我的hive查询如下,但基本上我在http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html上执行了 "使用数据压缩将Amazon DynamoDB表导出到Amazon S3存储桶".

我现在想要反过来 - 拿走我的LZO文件并将它们放回蜂巢表中.你怎么做到这一点？我希望看到一些hive配置属性用于输入,但没有.我用Google搜索并发现了一些提示,但没有任何确定性,也没有任何效果.

s3中的文件格式为:s3:// [mybucket] /backup/year=2012/month=08/day=01/000000.lzo

这是我的HQL导出:

SET dynamodb.throughput.read.percent=1.0;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;      

CREATE EXTERNAL TABLE hiveSBackup (id bigint, periodStart string, allotted bigint, remaining bigint, created string, seconds bigint, served bigint, modified string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "${DYNAMOTABLENAME}", 
"dynamodb.column.mapping" = "id:id,periodStart:periodStart,allotted:allotted,remaining:remaining,created:created,seconds:seconds,served:served,modified:modified");

CREATE EXTERNAL TABLE s3_export (id bigint, periodStart string, allotted bigint, remaining bigint, created string, seconds bigint, served bigint, modified string)
 PARTITIONED BY (year string, month string, day string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 's3://<mybucket>/backup';

INSERT OVERWRITE TABLE s3_export
 PARTITION (year="${PARTITIONYEAR}", month="${PARTITIONMONTH}", day="${PARTITIONDAY}")
 SELECT * from hiveSBackup;

Run Code Online (Sandbox Code Playgroud)

任何想法如何从s3,解压缩,到蜂巢表？

Answer 1

Tim*_*Tim 6

EMR上的Hive可以直接从S3读取数据,您无需导入任何内容.您只需创建一个外部表并告诉它数据的位置.它还具有lzo支持设置.如果文件以.lzo扩展名结尾,Hive将使用lzo自动解压缩.

因此,要将s3中的lzo数据"导入"到hive中,只需创建一个指向lzo压缩数据s3的外部表,hive将在对其运行查询时对其进行解压缩.几乎就是你"导出"数据时所做的.那个s3_export表,你也可以阅读.

如果要将其导入非外部表,只需将覆盖插入新表并从外部表中选择.

除非我误解了你的问题并且你打算询问有关导入发电机的问题,而不仅仅是一个蜂巢表？

This is what I've been doing
SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

CREATE EXTERNAL TABLE users
(id int, username string, firstname string, surname string, email string, birth_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://bucket/someusers';

INSERT OVERWRITE TABLE users
SELECT * FROM someothertable;

Run Code Online (Sandbox Code Playgroud)

我最终得到了s3:// bucket/someusers下的一堆文件,扩展名为.lzo,可由hive读取.

您只需要在尝试写入压缩数据时设置编解码器,读取它会自动检测到压缩.

归档时间：	13 年，2 月前
查看次数：	2898 次
最近记录：	13 年前