Seq.contains包含在Spark Dataframe中的SQL中

Question

Seq.contains包含在Spark Dataframe中的SQL中

Mic*_*mlk 4 scala dataframe apache-spark apache-spark-sql

我有以下数据结构:

id: int
records: Seq[String]
other: boolean

在json文件中,为了便于测试:

var data = sc.makeRDD(Seq[String](
   "{\"id\":1, \"records\": [\"one\", \"two\", \"three\"], \"other\": true}", 
   "{\"id\": 2, \"records\": [\"two\"], \"other\": true}", 
   "{\"id\": 3, \"records\": [\"one\"], \"other\": false }"))
sqlContext.jsonRDD(data).registerTempTable("temp")

Run Code Online (Sandbox Code Playgroud)

而且我想过滤到只one在records字段中的记录,other等于只true使用SQL.

我可以通过一个filter(见下文)来做,但它可以只使用SQL吗？

sqlContext
    .sql("select id, records from temp where other = true")
    .rdd.filter(t => t.getAs[Seq[String]]("records").contains("one"))
    .collect()

Run Code Online (Sandbox Code Playgroud)

Answer 1

eli*_*sah 5

Spark SQL支持绝大多数Hive功能,因此您可以使用它array_contains来完成工作:

spark.sql("select id, records from temp where other = true and array_contains(records,'one')").show
# +---+-----------------+
# | id|          records|
# +---+-----------------+
# |  1|[one, two, three]|
# +---+-----------------+

Run Code Online (Sandbox Code Playgroud)

注:在火花1.5,sqlContext.jsonRDD被弃用,改用以下内容:

sqlContext.read.format("json").json(data).registerTempTable("temp")

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	1677 次
最近记录：	7 年前