小编Mir*_*rko的帖子

Spark SQL HiveContext - saveAsTable创建错误的模式

我尝试将Dataframe存储到Spark 1.3.0(PySpark)中的持久Hive表中.这是我的代码:

sc = SparkContext(appName="HiveTest")
hc = HiveContext(sc)
peopleRDD = sc.parallelize(['{"name":"Yin","age":30}'])
peopleDF = hc.jsonRDD(peopleRDD)
peopleDF.printSchema()
#root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
peopleDF.saveAsTable("peopleHive")
Run Code Online (Sandbox Code Playgroud)

我期望的Hive输出表是:

Column  Data Type   Comments
age     long        from deserializer
name    string      from deserializer
Run Code Online (Sandbox Code Playgroud)

但上面代码的实际输出Hive表是:

Column  Data Type       Comments
col     array<string>   from deserializer
Run Code Online (Sandbox Code Playgroud)

为什么Hive表与DataFrame的架构不同?如何实现预期产量?

hive apache-spark apache-spark-sql

7
推荐指数
1
解决办法
9170
查看次数

如何使用Hive(get_json_object)查询struct数组?

我将以下JSON对象存储在Hive表中:

{
  "main_id": "qwert",
  "features": [
    {
      "scope": "scope1",
      "name": "foo",
      "value": "ab12345",
      "age": 50,
      "somelist": ["abcde","fghij"]
    },
    {
      "scope": "scope2",
      "name": "bar",
      "value": "cd67890"
    },
    {
      "scope": "scope3",
      "name": "baz",
      "value": [
        "A",
        "B",
        "C"
      ]
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

"features"是一个长度不一的数组,即所有对象都是可选的.对象具有任意元素,但它们都包含"范围","名称"和"值".

这是我创建的Hive表:

CREATE TABLE tbl(
main_id STRING,features array<struct<scope:STRING,name:STRING,value:array<STRING>,age:INT,somelist:array<STRING>>>
)
Run Code Online (Sandbox Code Playgroud)

我需要一个Hive查询返回main_id和结构的值,名称为"baz",即

main_id baz_value
qwert ["A","B","C"]
Run Code Online (Sandbox Code Playgroud)

我的问题是Hive UDF" get_json_object "仅支持有限版本的JSONPath.它不支持像这样的路径get_json_object(features, '$.features[?(@.name='baz')]').

如何用Hive查询想要的结果?使用另一个Hive表结构可能更容易吗?

json hadoop hive hiveql

2
推荐指数
1
解决办法
3万
查看次数

标签 统计

hive ×2

apache-spark ×1

apache-spark-sql ×1

hadoop ×1

hiveql ×1

json ×1