Leo*_*eon 2 scala apache-spark rdd
我复制了sortByKey身体并重命名为sortByKey2,但他们给出了不同的结果.为什么第一个结果在这里错了?这是在日食中运行的.我重新启动了eclipse,但仍然得到了错误的结果.
package test.spark
import org.apache.spark.sql.SparkSession
object RddTests {
var spark = SparkSession.builder().appName("rdd-test").master("local[*]")
.enableHiveSupport()
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]) {
//mapValues
//combineWithKey
//foldByKey
sortByKey
sortByKey2
}
def sortByKey() {
val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3), ("Lucy", 1))
val rdd = sc.parallelize(people)
val sortByKeyRDD = rdd.sortByKey()
println;println("sortByKeyRDD")
sortByKeyRDD.foreach(println)
}
def sortByKey2() {
val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3), ("Lucy", 1))
val rdd = sc.parallelize(people)
val sortByKeyRDD = rdd.sortByKey()
println;println("sortByKeyRDD2")
sortByKeyRDD.foreach(println)
}
}
Run Code Online (Sandbox Code Playgroud)
输出是:
[Stage 0:> (0 + 0) / 4]
sortByKeyRDD
(Mobin,2)
(Mobin,1)
(Amy,1)
(Lucy,2)
(Lucy,3)
(Lucy,1)
sortByKeyRDD2
(Amy,1)
(Mobin,2)
(Mobin,1)
(Lucy,2)
(Lucy,3)
(Lucy,1)
Run Code Online (Sandbox Code Playgroud)
foreach不保证元素将按任何特定顺序处理.如果这样,sortByKeyRDD.collect.foreach(println)您将按顺序查看结果,但这假设您的数据适合驱动程序内存.
如sortByKey文档中所述:
调用收集或保存生成的RDD将返回或输出有序的记录列表
[编辑]使用toLocalIterator而不是collect将驱动程序内存要求限制为最大的单个分区.感谢user8371915在评论中指出了这一点.
| 归档时间: |
|
| 查看次数: |
647 次 |
| 最近记录: |