在Hive表中使用JSON-SerDe

pra*_*mav 6 hadoop hive

我正在尝试从下面的链接http://code.google.com/p/hive-json-serde/wiki/GettingStarted访问 JSON-SerDe 。

         CREATE TABLE my_table (field1 string, field2 int, 
                                     field3 string, field4 double)
         ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' ;
Run Code Online (Sandbox Code Playgroud)

我添加了Json-SerDe jar作为

          ADD JAR /path-to/hive-json-serde.jar;
Run Code Online (Sandbox Code Playgroud)

并加载数据为

LOAD DATA LOCAL INPATH  '/home/hduser/pradi/Test.json' INTO TABLE my_table;
Run Code Online (Sandbox Code Playgroud)

并且它成功加载了数据。

但是当查询数据为

从my_table中选择*

我从表中只得到一行

data1 100更多data1 123.001

Test.json包含

{"field1":"data1","field2":100,"field3":"more data1","field4":123.001} 

{"field1":"data2","field2":200,"field3":"more data2","field4":123.002} 

{"field1":"data3","field2":300,"field3":"more data3","field4":123.003} 

{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
Run Code Online (Sandbox Code Playgroud)

问题出在哪里?为什么查询表时只有一行而不是4行。并且在 / user / hive / warehouse / my_table中包含所有4行!


hive> add jar /home/hduser/pradeep/hive-json-serde-0.2.jar;
Added /home/hduser/pradeep/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradeep/hive-json-serde-0.2.jar

hive> CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
>                                 field3 string, field4 double)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
> WITH SERDEPROPERTIES (
>   "field1"="$.field1",
>   "field2"="$.field2",
>   "field3"="$.field3",
>   "field4"="$.field4"
> );
OK
Time taken: 0.088 seconds

hive> LOAD DATA LOCAL INPATH  '/home/hduser/pradi/test.json' INTO TABLE my_table;
Copying data from file:/home/hduser/pradi/test.json
Copying file: file:/home/hduser/pradi/test.json
Loading data to table default.my_table
OK
Time taken: 0.426 seconds

hive> select * from my_table;
OK
data1   100     more data1      123.001
Time taken: 0.17 seconds
Run Code Online (Sandbox Code Playgroud)

我已经发布了test.json文件的内容。因此您可以看到该查询仅产生一行

data1   100     more data1      123.001
Run Code Online (Sandbox Code Playgroud)

我已将json文件更改为employee.json,其中包含

{“ firstName”:“ Mike”,“ lastName”:“ Chepesky”,“ employeeNumber”:1840192}

并更改了表,但当我查询表时它显示了空值

hive> add jar /home/hduser/pradi/hive-json-serde-0.2.jar;
Added /home/hduser/pradi/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradi/hive-json-serde-0.2.jar

hive> create EXTERNAL table employees_json (firstName string, lastName string,        employeeNumber int )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
OK
Time taken: 0.297 seconds


hive> load data local inpath '/home/hduser/pradi/employees.json' into table     employees_json;
Copying data from file:/home/hduser/pradi/employees.json
Copying file: file:/home/hduser/pradi/employees.json
Loading data to table default.employees_json
OK
Time taken: 0.293 seconds


 hive>select * from employees_json;
  OK
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
Time taken: 0.194 seconds
Run Code Online (Sandbox Code Playgroud)

Mic*_*las 3

如果有疑问,如果没有日志(请参阅入门) ,很难判断发生了什么。WITH SERDEPROPERTIES只是一个快速的想法 - 你可以尝试一下它是否可以这样工作:

CREATE EXTERNAL TABLE my_table (field1 string, field2 int, 
                                field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES (
  "field1"="$.field1",
  "field2"="$.field2",
  "field3"="$.field3",
  "field4"="$.field4" 
);
Run Code Online (Sandbox Code Playgroud)

您可能还想尝试一下 ThinkBigAnalytics 的一个分支。

更新:结果 Test.json 中的输入是无效的 JSON,因此记录会折叠。

有关更多详细信息,请参阅答案/sf/answers/819559541/ 。