fig*_*uts 4 apache-spark apache-spark-sql pyspark spark-ui spark-shuffle
根据文档:
洗牌溢出(内存)是内存中洗牌数据的反序列化形式的大小。
Shuffle 溢出(磁盘)是磁盘上数据的序列化形式的大小。
我对shuffle的理解是这样的:
For each existing partition: new_partition = hash(partitioning_id)%200; target_executor = new_partition%num_executors其中%是模运算符,num_executors 是集群上执行程序的数量。我对shuffle操作的理解是否正确?
您能帮我将 shuffle 溢出(内存)和 shuffle 溢出(磁盘)的定义放在 shuffle 机制的上下文中(如果正确的话,上面描述的)?例如(也许):“shuffle溢出(磁盘)是上面提到的第2点中发生的部分,其中200个分区被转储到各自节点的磁盘上”(我不知道这样说是否正确;只是举个例子)
Lets take a look at docu where we can find this:
Shuffle read: Total shuffle bytes and records read, includes both data read locally and data read from remote executors
Run Code Online (Sandbox Code Playgroud)
This is what your executor loads into memory when stage processing is starting, you can think about this as shuffle files prepared in previous stage by other executors
Shuffle write: Bytes and records written to disk in order to be read by a shuffle in a future stage
Run Code Online (Sandbox Code Playgroud)
This is size of output of your stage which may be picked up by next stage for processing, in other words this is a size of shuffle files which this stage created
And now what is shuffle spill
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
Run Code Online (Sandbox Code Playgroud)
Shuffle spill hapens when your executor is reading shuffle files but they cannot fit into execution memory of this executor. When this happens, some chunk of data is removed from memory and written to disc (its spilled to disc in other words)
Moving back to your question: what is the difference between spill(memory) and spill(disc)? Its describing excatly the same chunk of data. First metric is describing space occupied by those spilled data in memory before they were moved to disc, second is describing their size when written to disc. Those two metrics may be different because data may be represented differently when written to disc, for example they may be compressed.
If you want to read more:
"Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former. Note that both metrics are aggregated over the entire duration of the task (i.e. within each task you can spill multiple times)."
Spill is represented by two values: (These two values are always presented together.)
Spill (Memory): is the size of the data as it exists in memory before it is spilled.
Spill (Disk): is size of the data that gets spilled, serialized and, written into disk and gets compressed.
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
9834 次 |
| 最近记录: |