Mar*_*rko 5 apache-spark pyspark
我是 Spark 的新手,我似乎遇到了一些性能问题。我正在尝试计算 DataFrame 中不同参数之间的简单计算(我在 Spark 1.5.2 上使用 PySpark 来执行此操作),但问题是与实际计算相比,我的任务反序列化时间非常长。
以下是计算两对不同参数之间的计算时的屏幕截图。
为了计算相关性,我只是使用full_dataframe.stat.corr('param1', 'param2')
. 数据集已先缓存,然后执行此计算。我实际上是在尝试计算所有参数之间的相关性并生成相关性图,因此我在循环中调用此行,在其中迭代不同的参数组合。缓存数据集大小为 5.2GB。
我在一台 4 集群机器 (YARN) 上运行这项工作,其中每台机器都有:
我正在通过 Jupyter 使用 PySpark,并且我已经开始使用它:
pyspark --master yarn --driver-memory 2560m --num-executors 4 --executor-cores 4 --executor-memory 5G --conf spark.yarn.executor.memoryOverhead=2048
我尝试过使用不同数量的分区df.repartition(no_of_partitions)
,例如 16、32、128、256,但没有任何帮助。
此外,一段时间后,我的工作完全中断,我从 ui 中收到以下错误:
HTTP ERROR 500
Problem accessing /proxy/application_1485432889177_0016/stages/stage. Reason:
Connection to http://192.168.84.27:4040 refused
Caused by:
org.apache.http.conn.HttpHostConnectException: Connection to http://192.168.84.27:4040 refused
Run Code Online (Sandbox Code Playgroud)
当我查看 Jupyter 的输出时,我看到了以下异常:
17/01/29 17:06:06 ERROR Utils: Uncaught exception in thread task-result-getter-14
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:501)
Exception in thread "task-result-getter-13" 17/01/29 17:06:06 ERROR Utils: Uncaught exception in thread task-result-getter-15
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:501)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:501)
Exception in thread "task-result-getter-12" Exception in thread "task-result-getter-14" java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
Run Code Online (Sandbox Code Playgroud)
这种行为的原因可能是什么以及如何解决?
更改最小和最大拆分大小后,我在驱动程序中遇到了类似的 GC 错误:
17/01/29 21:37:43 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-18] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.resize(ByteVectorImpl.java:84)
at sun.reflect.ByteVectorImpl.add(ByteVectorImpl.java:63)
at sun.reflect.ClassFileAssembler.emitByte(ClassFileAssembler.java:74)
at sun.reflect.ClassFileAssembler.emitShort(ClassFileAssembler.java:63)
at sun.reflect.ClassFileAssembler.emitConstantPoolNameAndType(ClassFileAssembler.java:120)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:313)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$writeObject$1.apply$mcV$sp(TorrentBroadcast.scala:162)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
at org.apache.spark.broadcast.TorrentBroadcast.writeObject(TorrentBroadcast.scala:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
17/01/29 21:37:46 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.resize(ByteVectorImpl.java:84)
at sun.reflect.ByteVectorImpl.add(ByteVectorImpl.java:63)
at sun.reflect.ClassFileAssembler.emitByte(ClassFileAssembler.java:74)
at sun.reflect.ClassFileAssembler.emitShort(ClassFileAssembler.java:63)
at sun.reflect.ClassFileAssembler.emitConstantPoolNameAndType(ClassFileAssembler.java:120)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:313)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$writeObject$1.apply$mcV$sp(TorrentBroadcast.scala:162)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
at org.apache.spark.broadcast.TorrentBroadcast.writeObject(TorrentBroadcast.scala:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
17/01/29 21:37:50 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-17] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.reflect.ByteVectorImpl.resize(ByteVectorImpl.java:84)
at sun.reflect.ByteVectorImpl.add(ByteVectorImpl.java:63)
at sun.reflect.ClassFileAssembler.emitByte(ClassFileAssembler.java:74)
at sun.reflect.ClassFileAssembler.emitShort(ClassFileAssembler.java:63)
at sun.reflect.ClassFileAssembler.emitConstantPoolNameAndType(ClassFileAssembler.java:120)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:313)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$writeObject$1.apply$mcV$sp(TorrentBroadcast.scala:162)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
at org.apache.spark.broadcast.TorrentBroadcast.writeObject(TorrentBroadcast.scala:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHistoryService: Discarding event
17/01/29 21:37:50 WARN YarnHist
归档时间: |
|
查看次数: |
653 次 |
最近记录: |