如果整个文件存在于单个节点上，spark 将如何加载巨大的 csv 文件？

Question

如果整个文件存在于单个节点上，spark 将如何加载巨大的 csv 文件？

如果我在单个 HDFS 节点上有一个 50GB 的巨大 CSV 文件，并且我正在尝试使用 spark.read 读取该文件，如下所示：

file_df = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').load('/hdfspath/customer.csv')

Run Code Online (Sandbox Code Playgroud)

我正在使用以下 spark-submit 提交 spark 作业：

spark-submit --master yarn --deploy-mode cluster --num-executors 4 --executor-memory 3G --executor-cores 5 --driver-memory 1G load_csv.py

Run Code Online (Sandbox Code Playgroud)

我知道在有动作之前，spark 不会将任何数据加载到内存中。但是当一个动作被触发时会发生什么，首先要做的是将文件读入内存以启动转换。spark是如何根据我提到的内核和执行器以小部分读取50GB文件的？

例如：我提到了 4 个 executor 和每个 executor 的 3GB 内存。在阅读时，将 spark 将主 customer.csv 文件转换为每个执行程序的 3GB 块并加载以下文件：

对于第一个 12GB：

Executor 1: 3GB
Executor 2: 3GB
Executor 3: 3GB
Executor 4: 3GB

Run Code Online (Sandbox Code Playgroud)

依此类推，直到整个文件完成处理？

或者它会根据 HDFS 块大小拆分文件并逐块读取例如：128MB 并尝试在每个 3GB 执行器中尽可能多地装入块？

如果文件完全存在于单个集群上（在我的情况下是这样），如何触发处理文件？

我理解它的解释有点广泛和繁琐，但任何帮助将不胜感激。

Answer 1

sat*_*hya 5

国际大学联盟，

这些是众所周知的通用做法，用于调整 spark 以处理大量数据集（50 GB is not a huge dataset either）

它会根据 HDFS 块大小拆分文件并逐块读取例如：128MB 并尝试在每个 3GB 执行器中尽可能多地装入块吗？

Ans：是的，1 partition for 1 HDFS block(128 MB ideally) for splitable fileformats在这种情况下， Spark 将根据执行程序内存而不是根据执行程序内存创建分区。

2. 内存和磁盘中的存储级别`

在缓存上（这与persist(StorageLevel.MEMORY_ONLY)它将所有分区存储在内存中相同 - 如果它不适合内存，您将获得 OOM。如果您调用persist(StorageLevel.MEMORY_AND_DISK)它，它将在内存中存储尽可能多的内容，其余部分将被放入磁盘。如果数据不适合磁盘，操作系统通常会杀死您的工人。

请注意，Spark 有自己的小型内存管理系统。您分配给 Spark 作业的一些内存用于保存正在处理的数据，如果您调用缓存或持久化，则一些内存用于存储。

from pyspark.storagelevel import StorageLevel

file_df = spark.read.format('csv').option('header', 'true')
.option('inferSchema', 'true').load('/hdfspath/customer.csv')
import org.apache.spark.storage.StorageLevel

file_df = file_df.persist(StorageLevel.MEMORY_AND_DISK)
//val df2 = df.persist(StorageLevel.DISK_ONLY)

Run Code Online (Sandbox Code Playgroud)

Storage Level    Space used  CPU time  In memory  On-disk  Serialized   Recompute some partitions
----------------------------------------------------------------------------------------------------
MEMORY_ONLY          High        Low       Y          N        N         Y    
MEMORY_ONLY_SER      Low         High      Y          N        Y         Y
MEMORY_AND_DISK      High        Medium    Some       Some     Some      N
MEMORY_AND_DISK_SER  Low         High      Some       Some     Y         N
DISK_ONLY            Low         High      N          Y        Y         N

Run Code Online (Sandbox Code Playgroud)

3.尝试以下选项进行内存选项设置。

spark-submit --master yarn --deploy-mode cluster --num-executors ex4 --executor-memory 3G --executor-cores 5 --driver-memory 3G load_csv.py

假设您有 10 个节点的集群，配置如下，

**Cluster Config:**
10 Nodes
16 cores per Node
64GB RAM per Node

Run Code Online (Sandbox Code Playgroud)

3.1 第一种方法：微小的执行器[每个内核一个执行器]：

- `--num-executors` = `In this approach, we'll assign one executor per core`
                    = `total-cores-in-cluster`
                   = `num-cores-per-node * total-nodes-in-cluster` 
                   = 16 x 10 = 160
- `--executor-cores` = 1 (one executor per core)
- `--executor-memory` = `amount of memory per executor`
                     = `mem-per-node/num-executors-per-node`
                     = 64GB/16 = 4GB

Run Code Online (Sandbox Code Playgroud)

正如我们上面所讨论的，每个内核只有一个执行程序，我们将无法利用在同一个 JVM 中运行多个任务的优势。此外，广播变量和累加器等共享/缓存变量将在节点的每个核心中复制 16 次。此外，我们没有为 Hadoop/Yarn 守护进程留下足够的内存开销，并且我们没有在 ApplicationManager 中计算。不好！

3.2 第二种方式：Fat executors（每个节点一个Executor）：

- `--num-executors` = `In this approach, we'll assign one executor per node`
                    = `total-nodes-in-cluster`
                   = 10
- `--executor-cores` = `one executor per node means all the cores of the node are assigned to one executor`
                     = `total-cores-in-a-node`
                     = 16
- `--executor-memory` = `amount of memory per executor`
                     = `mem-per-node/num-executors-per-node`
                     = 64GB/1 = 64GB

Run Code Online (Sandbox Code Playgroud)

每个 executor 有 16 个核心，除了 ApplicationManager 和守护进程不计算在内，HDFS 吞吐量会受到影响，并且会导致过多的垃圾结果。另外，不好！

3.3 第三种方法：胖（vs）小之间的平衡

根据上述建议，

**1. Cores**
Let’s assign 5 core per executors => `--executor-cores = 5 (for good HDFS throughput)`
Leave 1 core per node for Hadoop/Yarn daemons => `Num cores available per node = 16-1 = 15`
So, Total available of cores in cluster = 15 x 10 = 150

**2. Executors**
Number of available executors = `(total cores/num-cores-per-executor) = 150/5 = 30`
Leaving 1 executor for ApplicationManager => --num-executors = 29
Number of executors per node = 30/10 = 3
Memory per executor = 64GB/3 = 21GB
Counting off heap overhead = 7% of 21GB = 3GB. So, actual --executor-memory = 21 - 3 = 18GB

Run Code Online (Sandbox Code Playgroud)

所以，推荐的配置是：29 executors,18GB memory each和`5核以上的10个节点集群

--num-executors，--executor-cores并--executor-memory为他们控制CPU和内存的火花应用程序获取的量这三个PARAMS打火花的表现非常重要的作用。这使得用户了解配置它们的正确方法非常重要。

归档时间：	5 年，3 月前
查看次数：	599 次
最近记录：	5 年，3 月前