rdd.sortByKey给出了错误的结果

Leo*_*eon 2 scala apache-spark rdd

我复制了sortByKey身体并重命名为sortByKey2,但他们给出了不同的结果.为什么第一个结果在这里错了?这是在日食中运行的.我重新启动了eclipse,但仍然得到了错误的结果.

package test.spark

import org.apache.spark.sql.SparkSession

object RddTests {
  var spark = SparkSession.builder().appName("rdd-test").master("local[*]")
    .enableHiveSupport()
    .getOrCreate()

  val sc = spark.sparkContext

  def main(args: Array[String]) {
    //mapValues
    //combineWithKey
    //foldByKey
    sortByKey
    sortByKey2
  }    

  def sortByKey() {
    val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3), ("Lucy", 1))
    val rdd = sc.parallelize(people)
    val sortByKeyRDD = rdd.sortByKey()
    println;println("sortByKeyRDD")
    sortByKeyRDD.foreach(println)
  }

  def sortByKey2() {
    val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3), ("Lucy", 1))
    val rdd = sc.parallelize(people)
    val sortByKeyRDD = rdd.sortByKey()
    println;println("sortByKeyRDD2")
    sortByKeyRDD.foreach(println)
  }
}
Run Code Online (Sandbox Code Playgroud)

输出是:

[Stage 0:>                                                          (0 + 0) / 4]

sortByKeyRDD
(Mobin,2)
(Mobin,1)
(Amy,1)
(Lucy,2)
(Lucy,3)
(Lucy,1)

sortByKeyRDD2
(Amy,1)
(Mobin,2)
(Mobin,1)
(Lucy,2)
(Lucy,3)
(Lucy,1)
Run Code Online (Sandbox Code Playgroud)

Joe*_*las 5

foreach不保证元素将按任何特定顺序处理.如果这样,sortByKeyRDD.collect.foreach(println)您将按顺序查看结果,但这假设您的数据适合驱动程序内存.

sortByKey文档中所述:

调用收集或保存生成的RDD将返回或输出有序的记录列表

[编辑]使用toLocalIterator而不是collect将驱动程序内存要求限制为最大的单个分区.感谢user8371915在评论中指出了这一点.