我需要根据关键列“id”合并同一数据框中的行。在示例数据框中,1 行包含 id、name 和 age 的数据。另一行有id、name 和salary。具有相同键“id”的行必须在最终数据框中合并为一条记录。如果只有一个记录,也应该用空值 [Smith, and Jake] 显示它们,如下例所示。
计算需要在实时数据上进行,基于火花本机函数的解决方案将是理想的。我曾尝试根据年龄和城市列过滤记录以分隔数据框,并对 ID 执行左连接。但它的效率不是很高。寻找任何替代建议。提前致谢!
示例数据框
val inputDF= Seq(("100","John", Some(35),None)
,("100","John", None,Some("Georgia")),
("101","Mike", Some(25),None),
("101","Mike", None,Some("New York")),
("103","Mary", Some(22),None),
("103","Mary", None,Some("Texas")),
("104","Smith", Some(25),None),
("105","Jake", None,Some("Florida")))
.toDF("id","name","age","city")
Run Code Online (Sandbox Code Playgroud)
输入数据框
+---+-----+----+--------+
|id |name |age |city |
+---+-----+----+--------+
|100|John |35 |null |
|100|John |null|Georgia |
|101|Mike |25 |null |
|101|Mike |null|New York|
|103|Mary |22 |null |
|103|Mary |null|Texas |
|104|Smith|25 |null |
|105|Jake |null|Florida |
+---+-----+----+--------+
Run Code Online (Sandbox Code Playgroud)
预期输出数据帧
+---+-----+----+---------+
| id| name| age| city|
+---+-----+----+---------+
|100| …Run Code Online (Sandbox Code Playgroud)