小编Sir*_*irJ的帖子

Spark 的reduceByKey 最佳实践

我有一个具有下一个架构的数据框:

root
 |-- id_1: long (nullable = true)
 |-- id_2: long (nullable = true)
 |-- score: double (nullable = true)
Run Code Online (Sandbox Code Playgroud)

数据如下:

+----+----+------------------+
|id_1|id_2|score             |
+----+----+------------------+
|0   |9   |0.5888888888888889|
|0   |1   |0.6166666666666667|
|0   |2   |0.496996996996997 |
|1   |9   |0.6222222222222221|
|1   |6   |0.9082996632996633|
|1   |5   |0.5927450980392157|
|2   |3   |0.665774107440774 |
|3   |8   |0.6872367465504721|
|3   |8   |0.6872367465504721|
|5   |6   |0.5365909090909091|
+----+----+------------------+
Run Code Online (Sandbox Code Playgroud)

目标是为每个 id_1 找到具有最大得分 的 id_2。也许我错了,但是......只需要创建配对的 RDD:

root
 |-- _1: long (nullable = true)
 |-- _2: struct (nullable = true)
 | …
Run Code Online (Sandbox Code Playgroud)

python bigdata apache-spark apache-spark-sql pyspark

1
推荐指数
1
解决办法
1695
查看次数