需要帮忙!!!
我正在使用flume将Twitter提要流式传输到hdfs中,并将其加载hive进行分析。
步骤如下:
hdfs中的数据:
我已经avro schema在avsc文件中描述了并将其放入hadoop:
{"type":"record",
"name":"Doc",
"doc":"adoc",
"fields":[{"name":"id","type":"string"},
{"name":"user_friends_count","type":["int","null"]},
{"name":"user_location","type":["string","null"]},
{"name":"user_description","type":["string","null"]},
{"name":"user_statuses_count","type":["int","null"]},
{"name":"user_followers_count","type":["int","null"]},
{"name":"user_name","type":["string","null"]},
{"name":"user_screen_name","type":["string","null"]},
{"name":"created_at","type":["string","null"]},
{"name":"text","type":["string","null"]},
{"name":"retweet_count","type":["boolean","null"]},
{"name":"retweeted","type":["boolean","null"]},
{"name":"in_reply_to_user_id","type":["long","null"]},
{"name":"source","type":["string","null"]},
{"name":"in_reply_to_status_id","type":["long","null"]},
{"name":"media_url_https","type":["string","null"]},
{"name":"expanded_url","type":["string","null"]}]}
Run Code Online (Sandbox Code Playgroud)
我写了一个.hql文件来创建表并在其中加载数据:
create table tweetsavro
row format serde
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
stored as inputformat
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
outputformat
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.url'='hdfs:///avro_schema/AvroSchemaFile.avsc');
load data inpath '/test/twitter_data/FlumeData.*' overwrite into table tweetsavro;
Run Code Online (Sandbox Code Playgroud)
我已经成功运行.hql文件,但是当我select *from <tablename>在蜂巢中运行命令时,它显示以下错误:
tweetsavro的输出为:
hive> desc tweetsavro;
OK
id string
user_friends_count int
user_location string
user_description string
user_statuses_count int …Run Code Online (Sandbox Code Playgroud)