Ara*_*ram 1 scala apache-spark apache-spark-sql
val postsQuantiles = posts.stat.approxQuantile("_score", Array(0.25, 0.75), 0.0)
因以下错误而失败。我显然可以设置spark.driver.maxResultSize
克服这个错误,但我很好奇为什么这会向驱动程序收集数据?
[Stage 3:==================> (7 + 15) / 22]19/06/01 20:46:30 ERROR TaskSetManager: Total size of serialized results of 18 tasks (1030.8 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
19/06/01 20:46:30 ERROR TaskSetManager: Total size of serialized results of 19 tasks (1087.7 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
19/06/01 20:46:30 ERROR TaskSetManager: Total size of serialized results of 20 tasks (1145.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
19/06/01 20:46:30 ERROR TaskSetManager: Total size of serialized results of 21 tasks (1203.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
19/06/01 20:46:30 ERROR TaskSetManager: Total size of serialized results of 22 tasks (1261.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
[Stage 3:====================================> (14 + 8) / 22]org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 18 tasks (1030.8 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2131)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1086)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131)
at org.apache.spark.sql.execution.stat.StatFunctions$.multipleApproxQuantiles(StatFunctions.scala:102)
at org.apache.spark.sql.DataFrameStatFunctions.approxQuantile(DataFrameStatFunctions.scala:100)
at org.apache.spark.sql.DataFrameStatFunctions.approxQuantile(DataFrameStatFunctions.scala:75)
... 56 elided
Run Code Online (Sandbox Code Playgroud)
该approxQuantile
方法遵循 Greenwald-Khanna 算法来计算近似分位数(基于他们的文档https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions)。它允许您选择相对误差项。
在他们警告您的文档中,选择相对误差0.0
(如您所见)可能非常昂贵,这正是您所看到的。该算法比直接分位数更适合近似分位数。之所以要提取这么多数据,是因为要计算直接分位数,它至少需要将列中的所有数据提取到驱动程序中。
您可以从已发表的论文中阅读有关该算法的更多信息:http : //infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf
为了克服这个问题,我建议使用一些适当的小值的相对误差项,这给你“足够接近”的信心。