我正在尝试从下面的链接http://code.google.com/p/hive-json-serde/wiki/GettingStarted访问 JSON-SerDe 。
         CREATE TABLE my_table (field1 string, field2 int, 
                                     field3 string, field4 double)
         ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' ;
Run Code Online (Sandbox Code Playgroud)
我添加了Json-SerDe jar作为
          ADD JAR /path-to/hive-json-serde.jar;
Run Code Online (Sandbox Code Playgroud)
并加载数据为
LOAD DATA LOCAL INPATH  '/home/hduser/pradi/Test.json' INTO TABLE my_table;
Run Code Online (Sandbox Code Playgroud)
并且它成功加载了数据。
但是当查询数据为
从my_table中选择*;
我从表中只得到一行
data1 100更多data1 123.001
Test.json包含
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001} 
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002} 
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003} 
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
Run Code Online (Sandbox Code Playgroud)
问题出在哪里?为什么查询表时只有一行而不是4行。并且在 / user / hive / warehouse / my_table中包含所有4行!
hive> add jar /home/hduser/pradeep/hive-json-serde-0.2.jar;
Added /home/hduser/pradeep/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradeep/hive-json-serde-0.2.jar
hive> CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
>                                 field3 string, field4 double)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
> WITH SERDEPROPERTIES (
>   "field1"="$.field1",
>   "field2"="$.field2",
>   "field3"="$.field3",
>   "field4"="$.field4"
> );
OK
Time taken: 0.088 seconds
hive> LOAD DATA LOCAL INPATH  '/home/hduser/pradi/test.json' INTO TABLE my_table;
Copying data from file:/home/hduser/pradi/test.json
Copying file: file:/home/hduser/pradi/test.json
Loading data to table default.my_table
OK
Time taken: 0.426 seconds
hive> select * from my_table;
OK
data1   100     more data1      123.001
Time taken: 0.17 seconds
Run Code Online (Sandbox Code Playgroud)
我已经发布了test.json文件的内容。因此您可以看到该查询仅产生一行
data1   100     more data1      123.001
Run Code Online (Sandbox Code Playgroud)
我已将json文件更改为employee.json,其中包含
{“ firstName”:“ Mike”,“ lastName”:“ Chepesky”,“ employeeNumber”:1840192}
并更改了表,但当我查询表时它显示了空值
hive> add jar /home/hduser/pradi/hive-json-serde-0.2.jar;
Added /home/hduser/pradi/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradi/hive-json-serde-0.2.jar
hive> create EXTERNAL table employees_json (firstName string, lastName string,        employeeNumber int )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
OK
Time taken: 0.297 seconds
hive> load data local inpath '/home/hduser/pradi/employees.json' into table     employees_json;
Copying data from file:/home/hduser/pradi/employees.json
Copying file: file:/home/hduser/pradi/employees.json
Loading data to table default.employees_json
OK
Time taken: 0.293 seconds
 hive>select * from employees_json;
  OK
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
Time taken: 0.194 seconds
Run Code Online (Sandbox Code Playgroud)
    如果有疑问,如果没有日志(请参阅入门) ,很难判断发生了什么。WITH SERDEPROPERTIES只是一个快速的想法 - 你可以尝试一下它是否可以这样工作:
CREATE EXTERNAL TABLE my_table (field1 string, field2 int, 
                                field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES (
  "field1"="$.field1",
  "field2"="$.field2",
  "field3"="$.field3",
  "field4"="$.field4" 
);
Run Code Online (Sandbox Code Playgroud)
您可能还想尝试一下 ThinkBigAnalytics 的一个分支。
更新:结果 Test.json 中的输入是无效的 JSON,因此记录会折叠。
有关更多详细信息,请参阅答案/sf/answers/819559541/ 。
|   归档时间:  |  
           
  |  
        
|   查看次数:  |  
           30459 次  |  
        
|   最近记录:  |