如何在spark中的每个分区中对数据进行排序？

Question

如何在spark中的每个分区中对数据进行排序？

有一些数据:

Run Code Online (Sandbox Code Playgroud)

当我重新分区数据,没有排序时,代码是:

val sc = new SparkContext
val file = sc.textFile(args(0)).map { a => {
           val splits = a.split("\t")
           (new MyObject(splits(0), splits(1).toInt),"") } }
           .partitionBy(new MyPartitioner(3)) //.sortByKey()    no sort

Run Code Online (Sandbox Code Playgroud)

结果是:

//file:part-00000
(a  2,)
(a  1,)
(a  3,)

//file:part-00001
(b  2,)
(b  3,)
(b  1,)

//file:part-00002
(c  2,)
(c  3,)
(c  1,)

Run Code Online (Sandbox Code Playgroud)

当我重新分区数据和排序时,代码是:

val sc = new SparkContext
val file = sc.textFile(args(0)).map { a => {
           val splits = a.split("\t")
           (new MyObject(splits(0), splits(1).toInt),"") } }
           .partitionBy(new MyPartitioner(3)).sortByKey()

Run Code Online (Sandbox Code Playgroud)

结果是(这不是我想要的,排序的数据会影响原始分区):

//file:part-00000
(a  1,)
(a  2,)
(a  3,)
(b  1,)

//file:part-00001
(b  2,)
(b  3,)
(c  1,)

//file:part-00002
(c  2,)
(c  3,)

Run Code Online (Sandbox Code Playgroud)

我期望的结果是:

//file:part-00000
(a  1,)
(a  2,)
(a  3,)

//file:part-00001
(b  1,)
(b  2,)
(b  3,)

//file:part-00002
(c  1,)
(c  2,)
(c  3,)

Run Code Online (Sandbox Code Playgroud)

你可以帮帮我吗？非常感谢你!

Answer 1

sec*_*ree 6

的sortWithinPartitions功能Datasets也有效。

http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset

因此，您可以使用以下样式

df.repartition(col("A"), col("B")).sortWithinPartitions(desc("C")) ...

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 5

您可以使用repartitionAndSortWithinPartitions

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions

归档时间：	9 年，9 月前
查看次数：	7304 次
最近记录：	7 年，3 月前