Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

Question

Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

GuS*_*uku 17 parallel-processing hadoop mapreduce mpi

I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.

For example, one common rule of thumb advice I see can be summarized as...

Big data, non-iterative, fault tolerant => MapReduce
Speed, small data, iterative, non-Mapper-Reducer type => MPI

But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.

Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).

Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.

And how does Mahout, Mesos and Spark fit in all this?

在决定Hadoop MapReduce,MPI,Mesos,Spark和Mahout之间(或其组合)时,可以使用什么标准？

Answer 1

Aar*_*man 10

这个决定可能有很好的技术标准,但我没有看到任何发表的决定.似乎存在文化差异,据了解,MapReduce用于筛选企业环境中的数据,而科学工作负载则使用MPI.这可能是由于这些工作负载对网络性能的潜在敏感性.以下是关于如何找出的一些想法:

许多现代MPI实现可以在多个网络上运行,但是针对Infiniband进行了大量优化.MapReduce的规范用例似乎是在通过以太网连接的"白盒子"商品系统集群中.快速搜索"MapReduce Infiniband"会导致http://dl.acm.org/citation.cfm?id=2511027,这表明在MapReduce环境中使用Infiniband是一个相对较新的事情.

那么为什么你要在一个针对Infiniband进行高度优化的系统上运行呢？它比以太网贵得多,但在高网络争用的情况下具有更高的带宽,更低的延迟和更好的扩展性(参考:http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).

如果您的应用程序对已经融入许多MPI库的Infiniband优化的那些影响很敏感,那么这可能对您有用.如果您的应用程序对网络性能相对不敏感,并且花费更多时间在不需要进程之间通信的计算上,那么MapReduce可能是更好的选择.

如果您有机会运行基准测试,则可以在可用的系统上进行投影,以查看网络性能有多大改善.尝试限制你的网络:例如,将GigE降低到100mbit或Infiniband QDR到DDR,在结果中划一条线,看看购买由MPI优化的更快的互连是否能让你到达目的地.

Answer 2

小智 7

您在MapReduce上发布的有关FEM的链接:http://ieeexplore.ieee.org/xpl/login.jsp？ tp =&annumber = 6188175&url = http% 3A%2F%2Fieeexplore.ieee.org%2Fxpls% 2Fabs_all.jsp %3Farnumber%3D6188175

使用MPI.它在摘要中说明了它.他们将MPI的编程模型(非令人尴尬的并行)与HDFS相结合,以"分级"数据以利用数据局部性.

Hadoop纯粹是为了令人尴尬的并行计算.任何需要进程组织自己并以复杂方式交换数据的东西都将通过Hadoop获得废话.这可以从算法复杂性的角度以及从测量的角度来证明.

归档时间：	10 年，8 月前
查看次数：	6827 次
最近记录：	10 年，4 月前