Parquet过滤器下推功能不适用于Spark Dataset API

Kau*_*hal 5 apache-spark apache-spark-sql apache-spark-dataset catalyst-optimizer

这是我正在运行的示例代码.

使用mod列作为分区创建测试镶木地板数据集.

scala> val test = spark.range(0 , 100000000).withColumn("mod", $"id".mod(40))
test: org.apache.spark.sql.DataFrame = [id: bigint, mod: bigint]

scala> test.write.partitionBy("mod").mode("overwrite").parquet("test_pushdown_filter")

Run Code Online (Sandbox Code Playgroud)

之后,我将这些数据作为数据框架读取并在分区列上应用过滤器即mod.

scala> val df = spark.read.parquet("test_pushdown_filter").filter("mod = 5")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, mod: int]

scala> df.queryExecution.executedPlan
res1: org.apache.spark.sql.execution.SparkPlan =
*FileScan parquet [id#16L,mod#17] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 1, PartitionFilters: [
isnotnull(mod#17), (mod#17 = 5)], PushedFilters: [], ReadSchema: struct<id:bigint>

Run Code Online (Sandbox Code Playgroud)

你可以在执行计划中看到它只读取1个分区.

但是,如果您将相同的过滤器应用于数据集.它读取所有分区,然后应用过滤器.

scala> case class Test(id: Long, mod: Long)
defined class Test

scala> val ds = spark.read.parquet("test_pushdown_filter").as[Test].filter(_.mod==5)
ds: org.apache.spark.sql.Dataset[Test] = [id: bigint, mod: int]

scala> ds.queryExecution.executedPlan
res2: org.apache.spark.sql.execution.SparkPlan =
*Filter <function1>.apply
+- *FileScan parquet [id#22L,mod#23] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/Users/kprajapa/WorkSpace/places/test_pushdown_filter], PartitionCount: 40, PartitionFilter
s: [], PushedFilters: [], ReadSchema: struct<id:bigint>

Run Code Online (Sandbox Code Playgroud)

这是数据集API的工作原理吗？还是我错过了什么？

归档时间：	7 年，8 月前
查看次数：	2342 次
最近记录：	7 年，8 月前