Spark数据集超出了总RAM大小

sal*_*nbw 2 hadoop hdfs apache-spark

我最近在进行火花工作,遇到了一些我仍然无法解决的查询。

假设我有一个100GB的数据集,而我的群集内存大小为16 GB。

现在,我知道在简单读取文件并将其保存在HDFS中的情况下,Spark会对每个分区执行此操作。对100GB数据执行排序或聚合转换时会发生什么?由于排序时需要全部数据,它将如何处理100GB的内存?

我已经浏览了下面的链接,但这仅告诉我们在持久存在的情况下spark会执行什么操作,我正在寻找的是Spark聚合或对大于ram大小的数据集进行排序。

Spark RDD-分区是否始终在RAM中?

任何帮助表示赞赏。

dbu*_*osp 5

您可能想知道两件事。

  1. 一旦Spark达到内存限制,它将开始将数据溢出到磁盘。请检查此Spark 常见问题解答,另外还有一些关于SO的问题(例如this)
  2. There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.

Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.