小编Sel*_*amR的帖子

Spark SubQuery 扫描整个分区

我有一个按“日期”字段分区的配置单元表,我想编写一个查询以从最新(最大)分区中获取数据。

spark.sql("select field from table  where date_of = '2019-06-23'").explain(True)
vs 
spark.sql("select filed from table where date_of = (select max(date_of) from table)").explain(True)
Run Code Online (Sandbox Code Playgroud)

下面是两个查询的物理计划

*(1) Project [qbo_company_id#120L]
        +- *(1) FileScan parquet 
    table[qbo_company_id#120L,date_of#157] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[s3location..., PartitionCount: 1, PartitionFilters: [isnotnull(date_of#157), (cast(date_of#157 as string) = 2019-06-23)], PushedFilters: [], ReadSchema: struct<qbo_company_id:bigint>

*(1) Project [qbo_company_id#1L]
+- *(1) Filter (date_of#38 = Subquery subquery0)
   :  +- Subquery subquery0
   :     +- *(2) HashAggregate(keys=[], functions=[max(date_of#76)], output=[max(date_of)#78])
   :        +- Exchange SinglePartition
   :           +- *(1) HashAggregate(keys=[], functions=[partial_max(date_of#76)], output=[max#119])
   : …
Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql

5
推荐指数
1
解决办法
1005
查看次数

标签 统计

apache-spark ×1

apache-spark-sql ×1