我有一个具有下一个架构的数据框:
root
|-- id_1: long (nullable = true)
|-- id_2: long (nullable = true)
|-- score: double (nullable = true)
Run Code Online (Sandbox Code Playgroud)
数据如下:
+----+----+------------------+
|id_1|id_2|score |
+----+----+------------------+
|0 |9 |0.5888888888888889|
|0 |1 |0.6166666666666667|
|0 |2 |0.496996996996997 |
|1 |9 |0.6222222222222221|
|1 |6 |0.9082996632996633|
|1 |5 |0.5927450980392157|
|2 |3 |0.665774107440774 |
|3 |8 |0.6872367465504721|
|3 |8 |0.6872367465504721|
|5 |6 |0.5365909090909091|
+----+----+------------------+
Run Code Online (Sandbox Code Playgroud)
目标是为每个 id_1 找到具有最大得分 的 id_2。也许我错了,但是......只需要创建配对的 RDD:
root
|-- _1: long (nullable = true)
|-- _2: struct (nullable = true)
| …Run Code Online (Sandbox Code Playgroud)