hive xml serDe:table为空

ash*_*ini 2 hive

我想将xml数据存储到hive表,XML数据中:

<servicestatuslist>
   <recordcount>1266</recordcount> 
     <servicestatus id="435680">
     <status_text>/: 61%used(9714MB/15975MB) (<80%) : OK</status_text> 
     <display_name>/ Disk Usage</display_name> 
     <host_name>zabbix.vshodc.com</host_name> 
     </servicestatus>
</servicestatuslist>
Run Code Online (Sandbox Code Playgroud)

我已经将jar文件添加到路径中

hive> add jar /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar ;    
Added /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar to class path
Added resource: /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar
Run Code Online (Sandbox Code Playgroud)

我写了一个hive serDe查询:

 create table xml_AIR(id STRING, status_text STRING,display_name STRING ,host_name STRING)
    row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    with serdeproperties(
    "column.xpath.id"="/servicestatus/@id",
    "column.xpath.status_text"="/servicestatus/status_text/text()",
    "column.xpath.display_name"="/servicestatus/display_name/text()",
    "column.xpath.host_name"="/servicestatus/host_name/text()"
    )
    stored as
    inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION  '/user/cloudera/input/air.xml'
    tblproperties(
    "xmlinput.start"="<servicestatus",
    "xmlinput.end"="</servicestatus>"
    );
    OK
    Time taken: 1.609 seconds
Run Code Online (Sandbox Code Playgroud)

当我发出select命令时,它没有显示表的数据:

hive> select * from xml_AIR;       
OK
Time taken: 3.0 seconds
Run Code Online (Sandbox Code Playgroud)

上面的代码有什么问题?请帮忙.

Mah*_*esh 5

在处理XML Serde时,我遇到了同样的问题.经过一番努力,我通过单独使用"加载数据"语句修复它,并避免在"CREATE"语句中添加"LOCATION"属性.以下是我的XML数据.

<record customer_id="0000-JTALA">
        <income>200000</income>     
        <demographics>
            <gender>F</gender>
            <agecat>1</agecat>
            <edcat>1</edcat>
            <jobcat>2</jobcat>
            <empcat>2</empcat>
            <retire>0</retire>
            <jobsat>1</jobsat>
            <marital>1</marital>
            <spousedcat>1</spousedcat>
            <residecat>4</residecat>
            <homeown>0</homeown>
            <hometype>2</hometype>
            <addresscat>2</addresscat>
        </demographics>
        <financial>
            <income>18</income>
            <creddebt>1.003392</creddebt>
            <othdebt>2.740608</othdebt>
            <default>0</default>
        </financial>
    </record>
Run Code Online (Sandbox Code Playgroud)

CREATE TABLE语句:

CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics map<string,string>, financial map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.demographics"="/record/demographics/*",
"column.xpath.financial"="/record/financial/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);
Run Code Online (Sandbox Code Playgroud)

创建查询结果:

OK
Time taken: 0.925 seconds
hive>
Run Code Online (Sandbox Code Playgroud)

对于上面的create语句,我使用下面的"LOAD DATA"语句将XML文件中包含的数据加载到上面创建的表中.

hive> load data local inpath '/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml' overwrite into table xml_bank6;
Run Code Online (Sandbox Code Playgroud)

加载查询结果:

Copying data from file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
Copying file: file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
Loading data to table default.xml_bank6
Table default.xml_bank6 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 500, raw_data_size: 0]
OK
Time taken: 0.879 seconds
hive>
Run Code Online (Sandbox Code Playgroud)

最后,

SELECT查询和结果:

hive> select * from xml_bank6;
OK
0000-JTALA  200000  {"empcat":"2","jobcat":"2","residecat":"4","retire":"0","hometype":"2","addresscat":"2","homeown":"0","spousedcat":"1","gender":"F","jobsat":"1","edcat":"1","marital":"1","agecat":"1"}    {"default":"0","income":"18","othdebt":"2.740608","creddebt":"1.003392"}
Time taken: 0.149 seconds, Fetched: 1 row(s)
hive>
Run Code Online (Sandbox Code Playgroud)

在上面的查询中,我会建议"xmlinput.start"as 的值"<servicestatus id",而不是"<servicestatus"因为XML开始标记在模式中<servicestatus id="some data">.我相信这对你有帮助.