Eva*_* M. 6 apache-spark apache-spark-sql
Project
Sparks执行计划中node 的含义是什么?
我有一个包含以下内容的计划:
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, dm_country#population#country#839, population#17 AS dm_country#population#population#844]
+- Project [dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12 AS dm_country#population#country#839, population#17]
+- Project [6a1ad864-235f-4761-9a6d-0ca2a2b40686#22 AS dm_country#population#6a1ad864-235f-4761-9a6d-0ca2a2b40686#834, country#12, population#17]
+- RepartitionByExpression [country#12], 1000
+- Union
:- Project [ind#7 AS 6a1ad864-235f-4761-9a6d-0ca2a2b40686#22, country#12, population#17]
: +- Project [ind#7, country#12, population#2 AS population#17]
: +- Project [ind#7, country#1 AS country#12, population#2]
: +- Project [ind#0 AS ind#7, country#1, population#2]
: +- Relation[ind#0,country#1,population#2] JDBCRelation(schema_dbadmin.t_350) [numPartitions=100]
+- LogicalRDD [ind#45, country#46, population#47]
Run Code Online (Sandbox Code Playgroud)
注意:由于计划使用RepartitionByExpression
节点,因此它必须是逻辑查询计划.
逻辑查询计划中的项目节点代表Project
一元逻辑运算符,只要您显式或隐式使用某种投影,就会创建该节点.
更准确地说,Project
节点可以明确地出现在逻辑查询计划中以用于以下内容:
joinWith
,select
,unionByName
KeyValueGroupedDataset
运营商,即keys
,mapValues
SELECT
查询一个Project
节点也可以出现分析和优化阶段.
在Spark SQL中,数据集API为高级操作符提供,例如select
,filter
或groupBy
最终构建结构化查询的Catalyst逻辑计划.换句话说,这个看起来很简单的Dataset.select运算符只是创建一个LogicalPlan
with Project
节点.
val query = spark.range(4).select("id")
scala> println(query.queryExecution.logical)
'Project [unresolvedalias('id, None)]
+- Range (0, 4, step=1, splits=Some(8))
Run Code Online (Sandbox Code Playgroud)
(你本可以用于query.explain(extended = true)
上面但是这会给你所有可能隐藏点的4个计划)
您可以查看Dataset.select
运算符的代码.
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
Run Code Online (Sandbox Code Playgroud)
这个外观简单的select
运算符仅仅是Catalyst运算符的包装器,用于构建逻辑运算符的Catalyst树,从而提供逻辑计划.
注意 Spark SQL的Catalyst的优点在于它使用这种递归的LogicalPlan抽象来表示逻辑运算符或逻辑运算符树.
注意这同样适用于好的SQL,在解析之后,SQL文本被转换为逻辑运算符的AST.请参阅下面的示例.
Project
可以来去,因为投影是输出中的列数,可能会也可能不会出现在您的计划和查询中.
您可以使用Spark SQL的Catalyst DSL(在org.apache.spark.sql.catalyst.dsl
包对象中)使用Scala隐式转换构建Catalyst数据结构.如果你进入Spark测试,这可能非常有用.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.catalyst.dsl.plans._ // <-- gives table and select
import org.apache.spark.sql.catalyst.dsl.expressions.star
val plan = table("a").select(star())
scala> println(plan.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `a`
Run Code Online (Sandbox Code Playgroud)
scala> spark.range(4).createOrReplaceTempView("nums")
scala> spark.sql("SHOW TABLES").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | nums| true|
+--------+---------+-----------+
scala> spark.sql("SELECT * FROM nums").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=8)
scala> spark.sql("SELECT * FROM nums").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `nums`
== Analyzed Logical Plan ==
id: bigint
Project [id#40L]
+- SubqueryAlias nums
+- Range (0, 4, step=1, splits=Some(8))
== Optimized Logical Plan ==
Range (0, 4, step=1, splits=Some(8))
== Physical Plan ==
*Range (0, 4, step=1, splits=8)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
662 次 |
最近记录: |