Bis*_*Ten 2 apache-spark apache-spark-sql
我试图理解下面的物理计划。但我有几个疑问
== Physical Plan ==
*(13) Project [brochure_click_uuid#32, brochure_id#88L, page#36L, duration#188L]
+- *(13) BroadcastHashJoin [brochure_click_uuid#32], [brochure_click_uuid#87], Inner, BuildRight
:- *(13) HashAggregate(keys=[brochure_click_uuid#32, page#36L], functions=[sum(duration#142L)])
: +- Exchange hashpartitioning(brochure_click_uuid#32, page#36L, 200)
: +- *(11) HashAggregate(keys=[brochure_click_uuid#32, page#36L], functions=[partial_sum(duration#142L)])
: +- Union
: :- *(5) Project [brochure_click_uuid#32, page#36L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
: : +- *(5) Filter ((isnotnull(event#34) && NOT (event#34 = EXIT_VIEW)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
: : +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
: : +- *(4) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
: : +- Union
: : :- *(1) Project [brochure_click_uuid#32, cast(date_time#33 as timestamp) AS date_time#48, page#36L, event#34]
: : : +- *(1) Filter isnotnull(brochure_click_uuid#32)
: : : +- *(1) FileScan json [brochure_click_uuid#32,date_time#33,event#34,page#36L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
: : :- *(2) Project [brochure_click_uuid#6, cast(date_time#7 as timestamp) AS date_time#20, page#10L, event#8]
: : : +- *(2) Filter isnotnull(brochure_click_uuid#6)
: : : +- *(2) FileScan json [brochure_click_uuid#6,date_time#7,event#8,page#10L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/p..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
: : +- *(3) Project [brochure_click_uuid#60, cast(date_time#61 as timestamp) AS date_time#74, page#64L, event#62]
: : +- *(3) Filter isnotnull(brochure_click_uuid#60)
: : +- *(3) FileScan json [brochure_click_uuid#60,date_time#61,event#62,page#64L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
: +- *(10) Project [brochure_click_uuid#32, (page#36L + 1) AS page#166L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
: +- *(10) Filter ((((isnotnull(event#34) && isnotnull(page_view_mode#37)) && NOT (event#34 = EXIT_VIEW)) && (page_view_mode#37 = DOUBLE_PAGE_MODE)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
: +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
: +- *(9) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
: +- Union
: :- *(6) Project [brochure_click_uuid#32, cast(date_time#33 as timestamp) AS date_time#48, page#36L, page_view_mode#37, event#34]
: : +- *(6) Filter isnotnull(brochure_click_uuid#32)
: : +- *(6) FileScan json [brochure_click_uuid#32,date_time#33,event#34,page#36L,page_view_mode#37] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
: :- *(7) Project [brochure_click_uuid#6, cast(date_time#7 as timestamp) AS date_time#20, page#10L, page_view_mode#11, event#8]
: : +- *(7) Filter isnotnull(brochure_click_uuid#6)
: : +- *(7) FileScan json [brochure_click_uuid#6,date_time#7,event#8,page#10L,page_view_mode#11] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/p..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
: +- *(8) Project [brochure_click_uuid#60, cast(date_time#61 as timestamp) AS date_time#74, page#64L, page_view_mode#65, event#62]
: +- *(8) Filter isnotnull(brochure_click_uuid#60)
: +- *(8) FileScan json [brochure_click_uuid#60,date_time#61,event#62,page#64L,page_view_mode#65] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true]))
+- *(12) Project [brochure_id#88L, brochure_click_uuid#87]
+- *(12) Filter isnotnull(brochure_click_uuid#87)
+- *(12) FileScan json [brochure_click_uuid#87,brochure_id#88L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/b..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,brochure_id:bigint>
Run Code Online (Sandbox Code Playgroud)
我有以下问题
。
: +- Union
: :- *(5) Project [brochure_click_uuid#32, page#36L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
: : +- *(5) Filter ((isnotnull(event#34) && NOT (event#34 = EXIT_VIEW)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
: : +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
: : +- *(4) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
Run Code Online (Sandbox Code Playgroud)
--- 答案后更新 ---
所以在上面的查询计划或者你提到的小查询计划中,
请提供尽可能详细的解释。我是菜鸟但正在努力学习。
让我尝试一一回答您的问题:
哪个是头,哪个是尾,即从哪里开始并进一步遍历。
查询计划具有树的结构,因此您应该问什么是根,什么是叶。叶节点是嵌套最多的节点,在您的情况下它是FileScan json并且有更多的节点。因此,您开始阅读它们,并且应该找到位于计划顶部的根,在您的情况下,它是第一个Project运算符。
每行开头的数字是什么,例如(13)、(11)、(5)
它是codegenStageId。在物理规划阶段,Spark 为规划中的运算符生成 java 代码。我直接引用Spark源码:
codegenStageCounter为查询计划中的代码生成阶段生成 ID 。此 ID 用于帮助区分代码生成阶段。它作为物理计划的解释输出的一部分包含在内。该 ID 清楚地表明,并非所有相邻的代码生成计划运算符都处于同一代码生成阶段。
另外,星号 * 表示 Spark 生成了代码。
有些行开头有+-,有些行开头有:-。当 +- 打印时和 :- 在行前打印时有什么区别
有些运算符有更多的子运算符,例如 Union、BroadcastHashJoin 或 SortMergeJoin(还有其他运算符)。在这种情况下,该运算符的子级将在计划中显示如下:
Union
:- Project ...
: +- here can be child of project
:
+- Project ...
+- here can be child of project
Run Code Online (Sandbox Code Playgroud)
因此,这个计划意味着这两个项目都是 Union 操作员的子项目,因此它们在树中处于同一级别。
级联线是什么意思
这些级联
+- Project
+- Filter
+- Window
Run Code Online (Sandbox Code Playgroud)
只是意味着这Filter是 Filter 的子级Project,并且Window是 Filter 的子级,依此类推。它是一棵树,它将停在没有子节点的叶节点。在你的计划中,叶子是FileScan json
两条线之间有 : 连接而成的垂直线。如果这些线是什么意思。两个步骤如何相互关联
正如我上面所解释的,用 : 形成的垂直线用于连接树中同一级别的运算符。
| 归档时间: |
|
| 查看次数: |
1029 次 |
| 最近记录: |