cap*_*cin 8 python pickle apache-spark rdd pyspark
我正在使用大约100 MB腌制的广播变量,我接近:
>>> data = list(range(int(10*1e6)))
>>> import cPickle as pickle
>>> len(pickle.dumps(data))
98888896
Run Code Online (Sandbox Code Playgroud)
在具有3个c3.2xlarge执行程序和m3.large驱动程序的集群上运行,并使用以下命令启动交互式会话:
IPYTHON=1 pyspark --executor-memory 10G --driver-memory 5G --conf spark.driver.maxResultSize=5g
Run Code Online (Sandbox Code Playgroud)
在RDD中,如果我持久引用此广播变量,则内存使用量会爆炸.对于100 MB变量的100个引用,即使它被复制了100次,我预计数据使用总量不会超过10 GB(更不用说3个节点上的30 GB).但是,当我运行以下测试时,我看到内存不足错误:
data = list(range(int(10*1e6)))
metadata = sc.broadcast(data)
ids = sc.parallelize(zip(range(100), range(100)))
joined_rdd = ids.mapValues(lambda _: metadata.value)
joined_rdd.persist()
print('count: {}'.format(joined_rdd.count()))
Run Code Online (Sandbox Code Playgroud)
堆栈跟踪:
TaskSetManager: Lost task 17.3 in stage 0.0 (TID 75, 10.22.10.13):
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/lib/spark/python/pyspark/rdd.py", line 2355, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/lib/spark/python/pyspark/rdd.py", line 2355, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/lib/spark/python/pyspark/rdd.py", line 317, in func
return f(iterator)
File "/usr/lib/spark/python/pyspark/rdd.py", line 1006, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/lib/spark/python/pyspark/rdd.py", line 1006, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in load_stream
yield self._read_with_length(stream)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
MemoryError
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/25 23:57:15 ERROR TaskSetManager: Task 17 in stage 0.0 failed 4 times; aborting job
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-1-7a262fdfa561> in <module>()
7 joined_rdd.persist()
8 print('persist called')
----> 9 print('count: {}'.format(joined_rdd.count()))
/usr/lib/spark/python/pyspark/rdd.py in count(self)
1004 3
1005 """
-> 1006 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
1007
1008 def stats(self):
/usr/lib/spark/python/pyspark/rdd.py in sum(self)
995 6.0
996 """
--> 997 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
998
999 def count(self):
/usr/lib/spark/python/pyspark/rdd.py in fold(self, zeroValue, op)
869 # zeroValue provided to each partition is unique from the one provided
870 # to the final reduce call
--> 871 vals = self.mapPartitions(func).collect()
872 return reduce(op, vals, zeroValue)
873
/usr/lib/spark/python/pyspark/rdd.py in collect(self)
771 """
772 with SCCallSiteSync(self.context) as css:
--> 773 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
774 return list(_load_from_socket(port, self._jrdd_deserializer))
775
/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Run Code Online (Sandbox Code Playgroud)
我已经看到有关pickle反序列化的内存使用的先前线程是一个问题.但是,我希望广播变量只能被反序列化(并在执行器上加载到内存中)一次,并且后续引用.value引用该内存中的地址.但是,情况似乎并非如此.我错过了什么吗?
我用广播变量看到的例子将它们作为词典,用一次来转换一组数据(即用机场名称替换机场首字母缩略词).在这里坚持它们的动机是创建具有广播变量知识的对象以及如何与它交互,持久化这些对象,并使用它们执行多次计算(使用spark来保存它们在内存中).
使用大型(100 MB +)广播变量有哪些提示?是否持续误导广播变量?这是PySpark特有的问题吗?
谢谢!非常感谢您的帮助.
注意,我也在databricks论坛上发布了这个问题
编辑 - 后续问题:
有人建议默认的Spark序列化程序的批处理大小为65337.不同批次中序列化的对象不会被识别为相同的,并分配了不同的内存地址,这里通过内置id函数进行检查.然而,即使有一个更大的广播变量,理论上需要256个批次来序列化,我仍然只看到2个不同的副本.我不应该看到更多吗?我对批量序列化如何工作的理解不正确吗?
>>> sc.serializer.bestSize
65536
>>> import cPickle as pickle
>>> broadcast_data = {k: v for (k, v) in enumerate(range(int(1e6)))}
>>> len(pickle.dumps(broadcast_data))
16777786
>>> len(pickle.dumps({k: v for (k, v) in enumerate(range(int(1e6)))})) / sc.serializer.bestSize
256
>>> bd = sc.broadcast(broadcast_data)
>>> rdd = sc.parallelize(range(100), 1).map(lambda _: bd.value)
>>> rdd.map(id).distinct().count()
1
>>> rdd.cache().count()
100
>>> rdd.map(id).distinct().count()
2
Run Code Online (Sandbox Code Playgroud)
zer*_*323 11
嗯,魔鬼在细节.为了理解这可能发生的原因,我们将不得不仔细研究PySpark序列化器.首先让我们SparkContext使用默认设置创建:
from pyspark import SparkContext
sc = SparkContext("local", "foo")
Run Code Online (Sandbox Code Playgroud)
并检查什么是默认序列化程序:
sc.serializer
## AutoBatchedSerializer(PickleSerializer())
sc.serializer.bestSize
## 65536
Run Code Online (Sandbox Code Playgroud)
它告诉我们三件事:
AutoBatchedSerializer序列化器PickleSerializer用于执行实际工作bestSize 序列化批处理的是65536字节 快速浏览源代码将向您显示此序列化调整运行时当前序列化的记录数,并尝试将批量大小保持在10*以下bestSize.重要的是,并非单个分区中的所有记录都是同时序列化的.
我们可以通过实验检查如下:
from operator import add
bd = sc.broadcast({})
rdd = sc.parallelize(range(10), 1).map(lambda _: bd.value)
rdd.map(id).distinct().count()
## 1
rdd.cache().count()
## 10
rdd.map(id).distinct().count()
## 2
Run Code Online (Sandbox Code Playgroud)
正如您在序列化 - 反序列化后的简单示例中所看到的那样,我们得到两个不同的对象.您可以观察到直接使用的类似行为pickle:
v = {}
vs = [v, v, v, v]
v1, *_, v4 = pickle.loads(pickle.dumps(vs))
v1 is v4
## True
(v1_, v2_), (v3_, v4_) = (
pickle.loads(pickle.dumps(vs[:2])),
pickle.loads(pickle.dumps(vs[2:]))
)
v1_ is v4_
## False
v3_ is v4_
## True
Run Code Online (Sandbox Code Playgroud)
取消对同一对象后,在同一批次引用中序列化的值.来自不同批次的值指向不同的对象.
在实践中,Spark有多个序列化和不同的序列化策略.例如,您可以使用无限大小的批次:
from pyspark.serializers import BatchedSerializer, PickleSerializer
rdd_ = (sc.parallelize(range(10), 1).map(lambda _: bd.value)
._reserialize(BatchedSerializer(PickleSerializer())))
rdd_.cache().count()
rdd_.map(id).distinct().count()
## 1
Run Code Online (Sandbox Code Playgroud)
您可以通过传递serializer和/或batchSize参数到SparkContext构造函数来更改序列化程序:
sc = SparkContext(
"local", "bar",
serializer=PickleSerializer(), # Default serializer
# Unlimited batch size -> BatchedSerializer instead of AutoBatchedSerializer
batchSize=-1
)
sc.serializer
## BatchedSerializer(PickleSerializer(), -1)
Run Code Online (Sandbox Code Playgroud)
选择不同的序列化器和批处理策略会导致不同的权衡(速度,序列化任意对象的能力,内存要求等).
您还应该记住,Spark中的广播变量不会在执行程序线程之间共享,因此同一个工作程序可以同时存在多个反序列化副本.
此外,如果执行需要改组的转换,您将看到类似的行为.
| 归档时间: |
|
| 查看次数: |
9894 次 |
| 最近记录: |