Spark v3.0.0 - 警告 DAGScheduler：广播大小为 xx 的大型任务二进制文件

Question

Spark v3.0.0 - 警告 DAGScheduler：广播大小为 xx 的大型任务二进制文件

vit*_*a96 11 java apache-spark apache-spark-ml apache-spark-mllib

我是火花新手。我正在使用以下配置集在 Spark 独立版 (v3.0.0) 中编写机器学习算法：

SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");

Run Code Online (Sandbox Code Playgroud)

这就是我创建 CrossValidator 的方式

ParamMap[] paramGrid = new ParamGridBuilder()
            .addGrid(gbt.maxBins(), new int[]{50})
            .addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
            .addGrid(gbt.maxIter(), new int[]{5, 20, 40})
            .addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
            .build();

    CrossValidator gbcv = new CrossValidator()
            .setEstimator(gbt)
            .setEstimatorParamMaps(paramGrid)
            .setEvaluator(gbevaluator)
            .setNumFolds(5)
            .setParallelism(8)
            .setSeed(session.getArguments().getTrainingRandom());

Run Code Online (Sandbox Code Playgroud)

问题是，当（在 paramGrid 中） maxDepth 只是 {2, 5} 和 maxIter {5, 20} 时，一切都工作得很好，但是当它像上面的代码中那样时，它会不断记录：，其中 WARN DAGScheduler: broadcasting large task binary with size xxxx 从 1000 KiB 变为2.9 MiB，经常导致超时异常我应该更改哪些 Spark 参数以避免这种情况？

Answer 1

小智 3

对于超时问题，请考虑更改以下配置：

Spark.sql.autoBroadcastJoinThreshold 设置为 -1。

这将消除广播大小 10MB 的限制。

归档时间：	5 年前
查看次数：	17239 次
最近记录：	1 年，7 月前