如何在Spark Dataframe上获取按结果分组的元组？

Question

如何在Spark Dataframe上获取按结果分组的元组？

Shi*_*ani 2 scala aggregate user-defined-functions apache-spark-sql

我正在尝试根据ID对实体进行分组，运行下面的代码，我有此数据框：

val pet_type_count = pet_list.groupBy("id","pets_type").count()
pet_type_count.sort("id").limit(20).show

Run Code Online (Sandbox Code Playgroud)

+----------+---------------------+-----+
|        id|            pets_type|count|
+----------+---------------------+-----+
|         0|                    0|    2|
|         1|                    0|    3|
|         1|                    3|    3|
|        10|                    0|    4|
|        10|                    1|    1|
|        13|                    0|    3|
|        16|                    1|    3|
|        17|                    1|    1|
|        18|                    1|    2|
|        18|                    0|    1|
|        19|                    1|    7|
+----------+---------------------+-----+

Run Code Online (Sandbox Code Playgroud)

我想按ID对分组的结果进行分组，现在返回每个ID的元组列表，因此我可以对每个ID应用以下udf：

val agg_udf =  udf { (v1: List[Tuple2[String, String]]) =>
    var feature_vector = Array.fill(5)(0)
    for (row <- v1) {
      val index = (5 - row._1.toInt)
      vector(index) = row._2.toInt
    }
    vector
}

val pet_vector_included = pet_type_count.groupBy("id").agg(agg_udf(col("pets_type_count")).alias("pet_count_vector"))

Run Code Online (Sandbox Code Playgroud)

为此，我需要获得以下信息：

val agg_udf =  udf { (v1: List[Tuple2[String, String]]) =>
    var feature_vector = Array.fill(5)(0)
    for (row <- v1) {
      val index = (5 - row._1.toInt)
      vector(index) = row._2.toInt
    }
    vector
}

val pet_vector_included = pet_type_count.groupBy("id").agg(agg_udf(col("pets_type_count")).alias("pet_count_vector"))

Run Code Online (Sandbox Code Playgroud)

我无法弄清楚在id上的groupby之后如何获取元组。任何帮助，将不胜感激！

Answer 1

Ram*_*jan 5

您可以简单地使用struct 内置函数将pets_type和count列作为一列，并使用collect_list 内置函数在按分组时收集新形成的列id。您可以按列orderBy对数据框进行排序id。

import org.apache.spark.sql.functions._
val pet_type_count = df.withColumn("struct", struct("pets_type", "count"))
  .groupBy("id").agg(collect_list(col("struct")).as("pets_type_count"))
  .orderBy("id")

Run Code Online (Sandbox Code Playgroud)

这应该给你你想要的结果

+---+---------------+
|id |pets_type_count|
+---+---------------+
|0  |[[0,2]]        |
|1  |[[0,3], [3,3]] |
|10 |[[0,4], [1,1]] |
|13 |[[0,3]]        |
|16 |[[1,3]]        |
|17 |[[1,1]]        |
|18 |[[1,2], [0,1]] |
|19 |[[1,7]]        |
+---+---------------+

Run Code Online (Sandbox Code Playgroud)

因此，您可以udf按如下所示应用已定义的功能（也需要进行一些修改）

val agg_udf =  udf { (v1: Seq[Row]) =>
  var feature_vector = Array.fill(5)(0)
  for (row <- v1) {
    val index = (4 - row.getAs[Int](0))
    feature_vector(index) = row.getAs[Int](1)
  }
  feature_vector
}

val pet_vector_included = pet_type_count.withColumn("pet_count_vector", agg_udf(col("pets_type_count")))

pet_vector_included.show(false)

Run Code Online (Sandbox Code Playgroud)

这应该给你

+---+---------------+----------------+
|id |pets_type_count|pet_count_vector|
+---+---------------+----------------+
|0  |[[0,2]]        |[0, 0, 0, 0, 2] |
|1  |[[0,3], [3,3]] |[0, 3, 0, 0, 3] |
|10 |[[0,4], [1,1]] |[0, 0, 0, 1, 4] |
|13 |[[0,3]]        |[0, 0, 0, 0, 3] |
|16 |[[1,3]]        |[0, 0, 0, 3, 0] |
|17 |[[1,1]]        |[0, 0, 0, 1, 0] |
|18 |[[1,2], [0,1]] |[0, 0, 0, 2, 1] |
|19 |[[1,7]]        |[0, 0, 0, 7, 0] |
+---+---------------+----------------+

Run Code Online (Sandbox Code Playgroud)

我希望答案是有帮助的

归档时间：	7 年，8 月前
查看次数：	1751 次
最近记录：	6 年，10 月前