火花中的flatMap是否会导致混乱？

Question

火花中的flatMap是否会导致混乱？

pyt*_*nic 7 scala bigdata apache-spark

spark中的flatMap是否像map函数一样,因此不会导致混乱,或者是否会触发shuffle.我怀疑它确实导致了改组.有人可以证实吗？

Answer 1

小智 7

map或flatMap都没有改组.导致洗牌的操作是:

重新分配业务:
- 重新分配:
- 合并:
ByKey操作(计数除外):
- GroupByKey:
- ReduceByKey:
加盟业务:
- 协同组:
- 加入:

尽管新洗牌数据的每个分区中的元素集将是确定性的,并且分区本身的排序也是如此,但这些元素的排序不是.如果在随机播放后需要可预测的有序数据,则可以使用:

mapPartitions使用例如.sorted对每个分区进行排序
repartitionAndSortWithinPartitions在同时重新分区的同时有效地对分区进行排序
sortBy来创建一个全局排序的RDD

更多信息:http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

Answer 2

Aiv*_*ean 5

没有洗牌。以下是这两个函数的来源：

/**
 * Return a new RDD by applying a function to all elements of this RDD.
 */
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

/**
 *  Return a new RDD by first applying a function to all elements of this
 *  RDD, and then flattening the results.
 */
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

Run Code Online (Sandbox Code Playgroud)

如您所见，RDD.flatMap只需调用flatMap代表分区的 Scala 迭代器即可。

归档时间：	10 年，2 月前
查看次数：	2559 次
最近记录：	7 年，4 月前