Luk*_*uke 4 garbage-collection scala g1gc apache-spark apache-spark-sql
我正在运行,Spark 2并试图改组约5 TB的json。我在混洗a时遇到了很长的垃圾收集暂停Dataset:
val operations = spark.read.json(inPath).as[MyClass]
operations.repartition(partitions, operations("id")).write.parquet("s3a://foo")
Run Code Online (Sandbox Code Playgroud)
是否有任何明显的配置调整可解决此问题?我的配置如下:
spark.driver.maxResultSize 6G
spark.driver.memory 10G
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError
spark.executor.memory 32G
spark.hadoop.fs.s3a.buffer.dir /raid0/spark
spark.hadoop.fs.s3n.buffer.dir /raid0/spark
spark.hadoop.fs.s3n.multipart.uploads.enabled true
spark.hadoop.parquet.block.size 2147483648
spark.hadoop.parquet.enable.summary-metadata false
spark.local.dir /raid0/spark
spark.memory.fraction 0.8
spark.mesos.coarse true
spark.mesos.constraints priority:1
spark.mesos.executor.memoryOverhead 16000
spark.network.timeout 600
spark.rpc.message.maxSize 1000
spark.speculation false
spark.sql.parquet.mergeSchema false
spark.sql.planner.externalSort true
spark.submit.deployMode client
spark.task.cpus 1
Run Code Online (Sandbox Code Playgroud)
添加以下标志摆脱了GC暂停。
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
Run Code Online (Sandbox Code Playgroud)
我认为确实需要大量的调整。这个 databricks帖子非常有帮助。
| 归档时间: |
|
| 查看次数: |
2625 次 |
| 最近记录: |