java.lang.ClassCastException: org.apache.hadoop.io.Text 不能转换为 org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow

Gre*_*umb 3 hive hiveql

我试图让压缩工作。

原始表定义为:

create external table orig_table (col1 String ...... coln String) 
.
.
.
partitioned by (pdate string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( "separatorChar" = "|")
STORED AS TEXTFILE location '/user/path/to/table/';
Run Code Online (Sandbox Code Playgroud)

表 orig_table 有大约 10 个分区,每个分区 100 行

为了压缩它,我创建了一个类似的表,唯一的修改是从 TEXTFILE 到 ORCFILE

create external table orig_table_orc (col1 String ...... coln String) 
.
.
.
partitioned by (pdate string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( "separatorChar" = "|")
STORED AS ORCFILE location '/user/path/to/table/';
Run Code Online (Sandbox Code Playgroud)

尝试通过以下方式复制记录:

set hive.exec.dynamic.partition.mode=nonstrict;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
[have tried with other codecs as well, with same error]
set mapred.output.compression.type=RECORD;
insert overwrite table zip_test.orig_table_orc partition(pdate) select * from default.orgi_table;
Run Code Online (Sandbox Code Playgroud)

我得到的错误是:

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col1":value ... "coln":value}
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:503)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176)
        ... 8 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
        at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:689)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:493)
        ... 9 more

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Run Code Online (Sandbox Code Playgroud)

如果我将 hive 表作为 SEQUENCEFILE 制作,同样的事情也会起作用——而不是 ORC,有什么解决办法吗?我看到了几个有相同错误但在 Java 程序而不是 Hive QL 中的问题

Sam*_*ter 5

啊!ORC 与 CSV 完全不同!!!

解释你做错了什么需要几个小时和很多关于 Hadoop 和一般数据库技术的书籍摘录,所以简短的回答是:ROW FORMAT 和 SERDE 对于格式没有意义。而且由于您是从 Hive 内部填充该表,因此恕我直言,它不是 EXTERNAL 而是“托管”表

create table orig_table_orc
 (col1 String ...... coln String) 
partitioned by (pdate string)
stored as Orc
location '/where/ever/you/want'
TblProperties ("orc.compress"="ZLIB")
Run Code Online (Sandbox Code Playgroud)