广播大型查找表导致 kryoserializer 错误

Jer*_*mps 5 java scala apache-spark

我有一个包含大小约为 10GB 的对象的大型 RDD。我想使用以下命令将其转换为要在 spark 中使用的查找表:

val lookupTable = sparkContext.broadcast(entitiesRDD.collect) 但它失败了:

17/02/27 17:33:25 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, d1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 2. To avoid this, increase spark.kryoserializer.buffer.max value. at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:299) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:240) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

我无法将 spark.kryoserializer.buffer.max 增加到 2048mb 以上,否则出现错误:

Caused by: java.lang.IllegalArgumentException: spark.kryoserializer.buffer.max must be less than 2048 mb, got: + 2048 mb. at org.apache.spark.serializer.KryoSerializer.<init>(KryoSerializer.scala:66)

其他人如何在 spark 中制作大型查找表?