我有很多json数组存储在一个表(jt)中,如下所示:
[{"ts":1403781896,"id":14,"log":"show"},{"ts":1403781896,"id":14,"log":"start"}]
[{"ts":1403781911,"id":14,"log":"press"},{"ts":1403781911,"id":14,"log":"press"}]
Run Code Online (Sandbox Code Playgroud)
每个阵列都是一个记录.
我想解析这个表,以获得一个包含3个字段的新表(日志):ts,id,log.我尝试使用get_json_object方法,但似乎该方法与json数组不兼容,因为我只获取空值.
这是我测试过的代码:
CREATE TABLE logs AS
SELECT get_json_object(jt.value, '$.ts') AS ts,
get_json_object(jt.value, '$.id') AS id,
get_json_object(jt.value, '$.log') AS log
FROM jt;
Run Code Online (Sandbox Code Playgroud)
我试图使用其他功能,但它们看起来很复杂.谢谢!:)
更新!我通过执行regexp解决了我的问题:
CREATE TABLE jt_reg AS
select regexp_replace(regexp_replace(value,'\\}\\,\\{','\\}\\\n\\{'),'\\[|\\]','') as valuereg from jt;
CREATE TABLE logs AS
SELECT get_json_object(jt_reg.valuereg, '$.ts') AS ts,
get_json_object(jt_reg.valuereg, '$.id') AS id,
get_json_object(jt_reg.valuereg, '$.log') AS log
FROM ams_json_reg;
Run Code Online (Sandbox Code Playgroud)
使用 explode()函数
hive (default)> CREATE TABLE logs AS
> SELECT get_json_object(single_json_table.single_json, '$.ts') AS ts,
> get_json_object(single_json_table.single_json, '$.id') AS id,
> get_json_object(single_json_table.single_json, '$.log') AS log
> FROM
> (SELECT explode(json_array_col) as single_json FROM jt) single_json_table ;
Automatically selecting local only mode for query
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
hive (default)> select * from logs;
OK
ts id log
1403781896 14 show
1403781896 14 start
1403781911 14 press
1403781911 14 press
Time taken: 0.118 seconds, Fetched: 4 row(s)
hive (default)>
Run Code Online (Sandbox Code Playgroud)
其中json_array_col是jt中保存jsons数组的列.
hive (default)> select json_array_col from jt;
json_array_col
["{"ts":1403781896,"id":14,"log":"show"}","{"ts":1403781896,"id":14,"log":"start"}"]
["{"ts":1403781911,"id":14,"log":"press"}","{"ts":1403781911,"id":14,"log":"press"}"]
Run Code Online (Sandbox Code Playgroud)
我刚刚遇到了这个问题,JSON 数组作为字符串存储在 hive 表中。
该解决方案有点笨拙和丑陋,但它有效并且不需要 serdes 或外部 UDF
SELECT
get_json_object(single_json_table.single_json, '$.ts') AS ts,
get_json_object(single_json_table.single_json, '$.id') AS id,
get_json_object(single_json_table.single_json, '$.log') AS log
FROM ( SELECT explode (
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"}","', '"}",,,,"'), ',,,,')
) FROM src_table) single_json_table;
Run Code Online (Sandbox Code Playgroud)
我把行拆开,这样会更容易阅读。我正在使用 substr() 去除第一个和最后一个字符,删除 [ 和 ] 。然后我使用 regex_replace 来匹配 json 数组中记录之间的分隔符,并将分隔符添加或更改为独特的东西,然后可以轻松地与 split() 一起使用将字符串转换为 json 对象的 hive 数组,然后如上一个解决方案中所述,与explode() 一起使用。
请注意,此处使用的分隔符正则表达式 ( "}"," ) 不适用于原始数据集...正则表达式必须为 ( "},\{" ),然后替换为 "} ,,,,{" 例如..
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"},\\{"', '"},,,,{"'), ',,,,')
Run Code Online (Sandbox Code Playgroud)