火花。如何开始理解以下执行计划

Bis*_*Ten 2 apache-spark apache-spark-sql

我试图理解下面的物理计划。但我有几个疑问

== Physical Plan ==
*(13) Project [brochure_click_uuid#32, brochure_id#88L, page#36L, duration#188L]
+- *(13) BroadcastHashJoin [brochure_click_uuid#32], [brochure_click_uuid#87], Inner, BuildRight
:- *(13) HashAggregate(keys=[brochure_click_uuid#32, page#36L], functions=[sum(duration#142L)])
:  +- Exchange hashpartitioning(brochure_click_uuid#32, page#36L, 200)
:     +- *(11) HashAggregate(keys=[brochure_click_uuid#32, page#36L], functions=[partial_sum(duration#142L)])
:        +- Union
:           :- *(5) Project [brochure_click_uuid#32, page#36L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
:           :  +- *(5) Filter ((isnotnull(event#34) && NOT (event#34 = EXIT_VIEW)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
:           :     +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
:           :        +- *(4) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
:           :           +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
:           :              +- Union
:           :                 :- *(1) Project [brochure_click_uuid#32, cast(date_time#33 as timestamp) AS date_time#48, page#36L, event#34]
:           :                 :  +- *(1) Filter isnotnull(brochure_click_uuid#32)
:           :                 :     +- *(1) FileScan json [brochure_click_uuid#32,date_time#33,event#34,page#36L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
:           :                 :- *(2) Project [brochure_click_uuid#6, cast(date_time#7 as timestamp) AS date_time#20, page#10L, event#8]
:           :                 :  +- *(2) Filter isnotnull(brochure_click_uuid#6)
:           :                 :     +- *(2) FileScan json [brochure_click_uuid#6,date_time#7,event#8,page#10L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/p..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
:           :                 +- *(3) Project [brochure_click_uuid#60, cast(date_time#61 as timestamp) AS date_time#74, page#64L, event#62]
:           :                    +- *(3) Filter isnotnull(brochure_click_uuid#60)
:           :                       +- *(3) FileScan json [brochure_click_uuid#60,date_time#61,event#62,page#64L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint>
:           +- *(10) Project [brochure_click_uuid#32, (page#36L + 1) AS page#166L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
:              +- *(10) Filter ((((isnotnull(event#34) && isnotnull(page_view_mode#37)) && NOT (event#34 = EXIT_VIEW)) && (page_view_mode#37 = DOUBLE_PAGE_MODE)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
:                 +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
:                    +- *(9) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
:                       +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
:                          +- Union
:                             :- *(6) Project [brochure_click_uuid#32, cast(date_time#33 as timestamp) AS date_time#48, page#36L, page_view_mode#37, event#34]
:                             :  +- *(6) Filter isnotnull(brochure_click_uuid#32)
:                             :     +- *(6) FileScan json [brochure_click_uuid#32,date_time#33,event#34,page#36L,page_view_mode#37] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
:                             :- *(7) Project [brochure_click_uuid#6, cast(date_time#7 as timestamp) AS date_time#20, page#10L, page_view_mode#11, event#8]
:                             :  +- *(7) Filter isnotnull(brochure_click_uuid#6)
:                             :     +- *(7) FileScan json [brochure_click_uuid#6,date_time#7,event#8,page#10L,page_view_mode#11] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/p..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
:                             +- *(8) Project [brochure_click_uuid#60, cast(date_time#61 as timestamp) AS date_time#74, page#64L, page_view_mode#65, event#62]
:                                +- *(8) Filter isnotnull(brochure_click_uuid#60)
:                                   +- *(8) FileScan json [brochure_click_uuid#60,date_time#61,event#62,page#64L,page_view_mode#65] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/e..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,date_time:string,event:string,page:bigint,page_view_mode:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true]))
+- *(12) Project [brochure_id#88L, brochure_click_uuid#87]
+- *(12) Filter isnotnull(brochure_click_uuid#87)
+- *(12) FileScan json [brochure_click_uuid#87,brochure_id#88L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/Interview Preparation/Bonial Interview Related/exercise-S/exercise-S/b..., PartitionFilters: [], PushedFilters: [IsNotNull(brochure_click_uuid)], ReadSchema: struct<brochure_click_uuid:string,brochure_id:bigint>
Run Code Online (Sandbox Code Playgroud)

我有以下问题

  1. 哪个是头,哪个是尾,即从哪里开始并进一步遍历。
  2. 哪个是头,哪个是尾,即从哪里开始并进一步遍历
  3. 每行开头的数字是多少,例如(13)、(11)、(5)。
  4. 有些行开头有+-,有些行开头有:-。当 +- 打印时和 :- 在行前打印时有什么区别
  5. 项目清单
  6. 级联线的含义是什么,例如如下。

:        +- Union
:           :- *(5) Project [brochure_click_uuid#32, page#36L, CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END AS duration#142L]
:           :  +- *(5) Filter ((isnotnull(event#34) && NOT (event#34 = EXIT_VIEW)) && isnotnull(CASE WHEN (event#34 = EXIT_VIEW) THEN null ELSE (unix_timestamp(_we0#143, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta)) - unix_timestamp(date_time#48, yyyy-MM-dd'T'HH:mm:ss, Some(Asia/Calcutta))) END))
:           :     +- Window [lead(date_time#48, 1, null) windowspecdefinition(brochure_click_uuid#32, date_time#48 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS _we0#143], [brochure_click_uuid#32], [date_time#48 ASC NULLS FIRST]
:           :        +- *(4) Sort [brochure_click_uuid#32 ASC NULLS FIRST, date_time#48 ASC NULLS FIRST], false, 0
:           :           +- Exchange hashpartitioning(brochure_click_uuid#32, 200)
Run Code Online (Sandbox Code Playgroud)
  1. 两条线之间有 : 连接而成的垂直线。如果这些线是什么意思。两个步骤如何相互关联

--- 答案后更新 ---

所以在上面的查询计划或者你提到的小查询计划中,

  1. 如何计算出作业数量(如果可能)、作业阶段以及构成每个作业阶段的步骤。
  2. 父节点及其所有子节点是否构成一个作业阶段,如果您提到的操作员在同一级别上有多个子节点,则意味着有多个阶段可以到达父节点。
  3. 最后,您在答案开始时提到有大量文件扫描,发生这种情况是因为重新计算 RDD/Dataframe 吗?

请提供尽可能详细的解释。我是菜鸟但正在努力学习。

Dav*_*rba 6

让我尝试一一回答您的问题:

哪个是头,哪个是尾,即从哪里开始并进一步遍历。

查询计划具有树的结构,因此您应该问什么是根,什么是叶。叶节点是嵌套最多的节点,在您的情况下它是FileScan json并且有更多的节点。因此,您开始阅读它们,并且应该找到位于计划顶部的根,在您的情况下,它是第一个Project运算符。

每行开头的数字是什么,例如(13)、(11)、(5)

它是codegenStageId。在物理规划阶段,Spark 为规划中的运算符生成 java 代码。我直接引用Spark源码:

codegenStageCounter为查询计划中的代码生成阶段生成 ID 。此 ID 用于帮助区分代码生成阶段。它作为物理计划的解释输出的一部分包含在内。该 ID 清楚地表明,并非所有相邻的代码生成计划运算符都处于同一代码生成阶段。

另外,星号 * 表示 Spark 生成了代码。

有些行开头有+-,有些行开头有:-。当 +- 打印时和 :- 在行前打印时有什么区别

有些运算符有更多的子运算符,例如 Union、BroadcastHashJoin 或 SortMergeJoin(还有其他运算符)。在这种情况下,该运算符的子级将在计划中显示如下:

Union
:- Project ...
:  +- here can be child of project
:
+- Project ...
   +- here can be child of project 
Run Code Online (Sandbox Code Playgroud)

因此,这个计划意味着这两个项目都是 Union 操作员的子项目,因此它们在树中处于同一级别。

级联线是什么意思

这些级联

+- Project
   +- Filter
      +- Window
Run Code Online (Sandbox Code Playgroud)

只是意味着这Filter是 Filter 的子级Project,并且Window是 Filter 的子级,依此类推。它是一棵树,它将停在没有子节点的叶节点。在你的计划中,叶子是FileScan json

两条线之间有 : 连接而成的垂直线。如果这些线是什么意思。两个步骤如何相互关联

正如我上面所解释的,用 : 形成的垂直线用于连接树中同一级别的运算符。

  • @BishamonTen 不客气。是的,您的计划中有 13 个代码生成阶段。不,没有 13 次洗牌。洗牌次数与作业阶段的数量有关。作业阶段与代码生成阶段不同。 (2认同)