Spark SQL:为什么一个查询有两个作业?

Moh*_*itt 9 unsafe apache-spark parquet apache-spark-sql

实验

我尝试了下面的代码片段Spark 1.6.1.

val soDF = sqlContext.read.parquet("/batchPoC/saleOrder") # This has 45 files
soDF.registerTempTable("so")
sqlContext.sql("select dpHour, count(*) as cnt from so group by dpHour order by cnt").write.parquet("/out/")
Run Code Online (Sandbox Code Playgroud)

Physical Plan方法是:

== Physical Plan ==
Sort [cnt#59L ASC], true, 0
+- ConvertToUnsafe
   +- Exchange rangepartitioning(cnt#59L ASC,200), None
      +- ConvertToSafe
         +- TungstenAggregate(key=[dpHour#38], functions=[(count(1),mode=Final,isDistinct=false)], output=[dpHour#38,cnt#59L])
            +- TungstenExchange hashpartitioning(dpHour#38,200), None
               +- TungstenAggregate(key=[dpHour#38], functions=[(count(1),mode=Partial,isDistinct=false)], output=[dpHour#38,count#63L])
                  +- Scan ParquetRelation[dpHour#38] InputPaths: hdfs://hdfsNode:8020/batchPoC/saleOrder
Run Code Online (Sandbox Code Playgroud)

对于这个查询,我有两个工作:Job 9Job 10 在此输入图像描述

因为Job 9,DAG是:

在此输入图像描述

因为Job 10,DAG是:

在此输入图像描述

意见

  1. 显然,jobs一个查询有两个.
  2. Stage-16(标记为Stage-14Job 9)中被跳过Job 10.
  3. Stage-15的最后RDD[48],是相同Stage-17的持续RDD[49].怎么样?我在日志中看到,Stage-15执行后,RDD[48]注册为RDD[49]
  4. Stage-17显示在driver-logs但从未执行过Executors.在driver-logs显示任务执行时,但是当我查看Yarn容器的日志时,没有任何证据表明收到任何task来自Stage-17.

支持这些观察的日志(仅由于后来的崩溃driver-logs我丢失了executor日志).可以看出,在Stage-17开始之前,RDD[49]已注册:

16/06/10 22:11:22 INFO TaskSetManager: Finished task 196.0 in stage 15.0 (TID 1121) in 21 ms on slave-1 (199/200)
16/06/10 22:11:22 INFO TaskSetManager: Finished task 198.0 in stage 15.0 (TID 1123) in 20 ms on slave-1 (200/200)
16/06/10 22:11:22 INFO YarnScheduler: Removed TaskSet 15.0, whose tasks have all completed, from pool 
16/06/10 22:11:22 INFO DAGScheduler: ResultStage 15 (parquet at <console>:26) finished in 0.505 s
16/06/10 22:11:22 INFO DAGScheduler: Job 9 finished: parquet at <console>:26, took 5.054011 s
16/06/10 22:11:22 INFO ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
16/06/10 22:11:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/06/10 22:11:22 INFO DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
16/06/10 22:11:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/06/10 22:11:22 INFO SparkContext: Starting job: parquet at <console>:26
16/06/10 22:11:22 INFO DAGScheduler: Registering RDD 49 (parquet at <console>:26)
16/06/10 22:11:22 INFO DAGScheduler: Got job 10 (parquet at <console>:26) with 25 output partitions
16/06/10 22:11:22 INFO DAGScheduler: Final stage: ResultStage 18 (parquet at <console>:26)
16/06/10 22:11:22 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 17)
16/06/10 22:11:22 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 17)
16/06/10 22:11:22 INFO DAGScheduler: Submitting ShuffleMapStage 17 (MapPartitionsRDD[49] at parquet at <console>:26), which has no missing parents
16/06/10 22:11:22 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 17.4 KB, free 512.3 KB)
16/06/10 22:11:22 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 8.9 KB, free 521.2 KB)
16/06/10 22:11:22 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 172.16.20.57:44944 (size: 8.9 KB, free: 517.3 MB)
16/06/10 22:11:22 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:1006
16/06/10 22:11:22 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 17 (MapPartitionsRDD[49] at parquet at <console>:26)
16/06/10 22:11:22 INFO YarnScheduler: Adding task set 17.0 with 200 tasks
16/06/10 22:11:23 INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 1125, slave-1, partition 0,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 1.0 in stage 17.0 (TID 1126, slave-2, partition 1,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 2.0 in stage 17.0 (TID 1127, slave-1, partition 2,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 3.0 in stage 17.0 (TID 1128, slave-2, partition 3,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 4.0 in stage 17.0 (TID 1129, slave-1, partition 4,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 5.0 in stage 17.0 (TID 1130, slave-2, partition 5,NODE_LOCAL, 1988 bytes)
Run Code Online (Sandbox Code Playgroud)

问题

  1. 为什么两个Jobs?这里打算DAG分成两个是jobs什么意思?
  2. Job 10DAG外观完整的查询执行.有什么具体Job 9的吗?
  3. 为什么Stage-17不跳过?它看起来像是虚拟tasks的,它们有什么用途.
  4. 后来,我尝试了另一个相当简单的查询.出乎意料的是,它正在创造3 Jobs.

    sqlContext.sql("按dphour顺序选择dpHour").write.parquet("/ out2 /")

Sim*_*Sim 7

当您使用高级数据框/数据集API时,可以将其留给Spark来确定执行计划,包括作业/阶段分块.这取决于许多因素,例如执行并行性,缓存/持久数据结构等.在Spark的未来版本中,随着优化器复杂性的增加,每个查询可能会看到更多作业,例如,某些数据源被采样以进行参数化基于成本的执行优化.

例如,我经常(但并非总是)看到写作从涉及混洗的处理中生成单独的作业.

最重要的是,如果您使用的是高级API,除非您必须使用大量数据进行非常详细的优化,否则很少需要深入研究特定的分块.与处理/输出相比,工作启动成本极低.

另一方面,如果您对Spark内部结构感到好奇,请阅读优化程序代码并参与Spark开发人员邮件列表.

  • 这很奇怪,为什么第二个工作阶段不能在第一份工作? (2认同)