这是更快的spark.sql或df.filter(“”)。select(“”)。使用scala

2 scala apache-spark apache-spark-sql

我有一个表是df,也有一个视图创建

table.createOrReplaceTempView("table")
Run Code Online (Sandbox Code Playgroud)

和查询是

spark.sql("SELECT column1 from TABLE where column2 = 'VALUE'")
Run Code Online (Sandbox Code Playgroud)

我想重写查询为

TABLE.filter(TABLE("column2") === "value").select(col("column1"))
Run Code Online (Sandbox Code Playgroud)

所以哪个查询比普通spark.sql更快,或者使用filter并选择?使用大型数据集时。

Sun*_*ugu 7

取决于您的用例,只需尝试这两种方法,快速运行的方法最适合您!

我建议你使用

1.spark.time(df.filter(“”).select(“”)) 

2.spark.time(spark.sql("")) 
Run Code Online (Sandbox Code Playgroud)

您可以打印出时间并使用在代码中执行时间最少的时间来更快地运行它。

  • 正是我想要的。简单干净!@杰克 (2认同)

Jas*_*r-M 5

I assume that if their physical execution plan is exactly the same, performance will be the same as well. So let's do a test, on Spark 2.2.0:

scala> import spark.implicits._
import spark.implicits._

scala> case class Record(column1: String, column2: String)
defined class Record

scala> val table = List(Record("foo", "value"), Record("bar", "notvalue")).toDF
table: org.apache.spark.sql.DataFrame = [column1: string, column2: string]

scala> table.createOrReplaceTempView("table")

scala> val a = spark.sql("SELECT column1 from TABLE where column2 = 'value'")
a: org.apache.spark.sql.DataFrame = [column1: string]

scala> val b = table.filter(table("column2") === "value").select(col("column1")) 
b: org.apache.spark.sql.DataFrame = [column1: string]

scala> a.explain()
== Physical Plan ==
*Project [column1#41]
+- *Filter (isnotnull(column2#42) && (column2#42 = value))
   +- LocalTableScan [column1#41, column2#42]

scala> b.explain()
== Physical Plan ==
*Project [column1#41]
+- *Filter (isnotnull(column2#42) && (column2#42 = value))
   +- LocalTableScan [column1#41, column2#42]
Run Code Online (Sandbox Code Playgroud)

Looks like there's no difference at all...