在数组内的Spark SQL中查询

laz*_*wiz 7 apache-spark apache-spark-sql spark-dataframe

为了给backfround我已经加载了JSON

sqlContext.read.json(sn3://...)
df.registerTable("posts")
Run Code Online (Sandbox Code Playgroud)

我在Spark中的表有以下模式

scala> posts.printSchema
root
 |-- command: string (nullable = true)
 |-- externalId: string (nullable = true)
 |-- sourceMap: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- url: string (nullable = true)
 |-- type: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

我想用主题标签"nike"选择所有帖子

sqlContext.sql("select sourceMap['hashtags'] as ht from posts where ht.contains('nike')");
Run Code Online (Sandbox Code Playgroud)

我得到一个错误未定义函数ht.contains

我不确定在数组中使用什么方法进行搜索.

谢谢!

laz*_*wiz 17

我发现答案指的是Hive SQL.

sqlContext.sql("select sourceMap['hashtags'] from posts where array_contains(sourceMap['hashtags'], 'nike')");
Run Code Online (Sandbox Code Playgroud)

关键功能是array_contains()