Use*_*130 4 serialization kryo apache-spark pyspark
我正在使用Spark 1.6.1和Python.在使用PySpark时如何启用Kryo序列化?
我在spark-default.conf文件中有以下设置:
spark.eventLog.enabled true
spark.eventLog.dir //local_drive/sparkLogs
spark.default.parallelism 8
spark.locality.wait.node 5s
spark.executor.extraJavaOptions -XX:+UseCompressedOops
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.classesToRegister Timing, Join, Select, Predicate, Timeliness, Project, Query2, ScanSelect
spark.shuffle.compress true
Run Code Online (Sandbox Code Playgroud)
并出现以下错误:
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: org.apache.spark.SparkException: Failed to register classes with Kryo
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:128)
at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:258)
at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
Caused by: java.lang.ClassNotFoundException: Timing
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:120)
Run Code Online (Sandbox Code Playgroud)
主类包含(Query2.py):
from Timing import Timing
from Predicate import Predicate
from Join import Join
from ScanSelect import ScanSelect
from Select import Select
from Timeliness import Timeliness
from Project import Project
conf = SparkConf().setMaster(master).setAppName(sys.argv[1]).setSparkHome("$SPARK_HOME")
sc = SparkContext(conf=conf)
conf.set("spark.kryo.registrationRequired", "true")
sqlContext = SQLContext(sc)
Run Code Online (Sandbox Code Playgroud)
我知道,"KRYO不会对PySpark产生重大影响,因为它只是将数据存储为字节[]对象,这是快速,甚至与Java序列化,但它可能是值得一试设置spark.serializer而不是试图注册任何课程" (Matei Zaharia,2014).但是,我需要注册这些课程.
提前致谢.
这不可能.Kryo是一个Java(JVM)序列化框架.它不能与Python类一起使用.序列化Python对象PySpark是使用Python序列化工具,包括标准pickle模块和改进版本的coludpickle.您可以在提示中找到有关正确使用大型广播变量的 PySpark序列化的其他信息吗?.
Sp可以在使用PySpark时启用Kryo序列化,这不会影响Python对象的序列化方式.它仅用于Java或Scala对象的序列化.
| 归档时间: |
|
| 查看次数: |
3237 次 |
| 最近记录: |