什么是火花溢出(磁盘和内存)?

fig*_*uts 4 apache-spark apache-spark-sql pyspark spark-ui spark-shuffle

根据文档:

洗牌溢出(内存)是内存中洗牌数据的反序列化形式的大小。

Shuffle 溢出(磁盘)是磁盘上数据的序列化形式的大小。

我对shuffle的理解是这样的:

  1. 每个执行器都会获取其上的所有分区,并将它们哈希分区为 200 个新分区(这 200 个可以更改)。每个新分区都与一个稍后将转到的执行程序相关联。例如:For each existing partition: new_partition = hash(partitioning_id)%200; target_executor = new_partition%num_executors其中%是模运算符,num_executors 是集群上执行程序的数量。
  2. 这些新分区被转储到其初始执行器的每个节点的磁盘上。每个新分区稍后都会被 target_executor 读取
  3. 目标执行器选取各自的新分区(在生成的 200 个分区中)

我对shuffle操作的理解是否正确?

您能帮我将 shuffle 溢出(内存)和 shuffle 溢出(磁盘)的定义放在 shuffle 机制的上下文中(如果正确的话,上面描述的)?例如(也许):“shuffle溢出(磁盘)是上面提到的第2点中发生的部分,其中200个分区被转储到各自节点的磁盘上”(我不知道这样说是否正确;只是举个例子)

M_S*_*M_S 7

Lets take a look at docu where we can find this:

Shuffle read: Total shuffle bytes and records read, includes both data read locally and data read from remote executors
Run Code Online (Sandbox Code Playgroud)

This is what your executor loads into memory when stage processing is starting, you can think about this as shuffle files prepared in previous stage by other executors

Shuffle write: Bytes and records written to disk in order to be read by a shuffle in a future stage
Run Code Online (Sandbox Code Playgroud)

This is size of output of your stage which may be picked up by next stage for processing, in other words this is a size of shuffle files which this stage created

And now what is shuffle spill

Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
Run Code Online (Sandbox Code Playgroud)

Shuffle spill hapens when your executor is reading shuffle files but they cannot fit into execution memory of this executor. When this happens, some chunk of data is removed from memory and written to disc (its spilled to disc in other words)

Moving back to your question: what is the difference between spill(memory) and spill(disc)? Its describing excatly the same chunk of data. First metric is describing space occupied by those spilled data in memory before they were moved to disc, second is describing their size when written to disc. Those two metrics may be different because data may be represented differently when written to disc, for example they may be compressed.

If you want to read more:

Cloudera questions

"Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former. Note that both metrics are aggregated over the entire duration of the task (i.e. within each task you can spill multiple times)."

Medium 1 Medium 2

Spill is represented by two values: (These two values are always presented together.)

Spill (Memory): is the size of the data as it exists in memory before it is spilled.

Spill (Disk): is size of the data that gets spilled, serialized and, written into disk and gets compressed.
Run Code Online (Sandbox Code Playgroud)