myt*_*hic 1 sql scala apache-spark apache-spark-sql
我有一个数据框,如下所示。除了字段之外,对应的所有值id都是相同的mappingcol。
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
|ref |banana |Map("lname"->"Nikki"| 2 |
|ddd |apple |Map("lname"->"tenka"| 1 |
+--------------------+----------------+--------------------+-------+
Run Code Online (Sandbox Code Playgroud)
我想以这样的方式合并具有同一行的行,以便我精确地得到一行,并且需要合并id的值。mappingcol输出应如下所示:
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
+--------------------+----------------+--------------------+-------+
Run Code Online (Sandbox Code Playgroud)
=mappingcol的值将是:id1
Map(
"name" -> "Sameer",
"lname" -> "tenka"
)
Run Code Online (Sandbox Code Playgroud)
我知道地图可以使用++运算符合并,所以这不是我担心的。我只是无法理解如何合并行,因为如果我使用 a groupBy,我就没有任何东西可以聚合行。
您可以使用 groupBy 然后管理一点地图
df.groupBy("id", "fruit", "misc").agg(collect_list("mappingcol"))
.as[(Int, String, String, Seq[Map[String, String]])]
.map { case (id, fruit, misc, list) => (id, fruit, misc, list.reduce(_ ++ _)) }
.toDF("id", "fruit", "misc", "mappingColumn")
Run Code Online (Sandbox Code Playgroud)
输出
+---+------+----+--------------------------------+
|id |fruit |misc|mappingColumn |
+---+------+----+--------------------------------+
|1 |apple |ddd |[name -> Sameer, lname -> tenka]|
|2 |banana|ref |[name -> Riyazi, lname -> Nikki]|
+---+------+----+--------------------------------+
Run Code Online (Sandbox Code Playgroud)