在scala中的数据框上应用groupBy和orderBy

Ant*_*ony 0 scala dataframe apache-spark apache-spark-sql

我需要对列值进行排序并在数据框中分组另一列。

数据框中的数据如下所示。

+------------+---------+-----+
|      NUM_ID|    TIME |SIG_V|
+------------+---------+-----+
|XXXXX01     |167499000|55   |
|XXXXX02     |167499000|     |
|XXXXX01     |167503000|     |
|XXXXX02     |179810000| 81.0|
|XXXXX02     |179811000| 81.0|
|XXXXX01     |179833000|     |
|XXXXX02     |179833000|     |
|XXXXX02     |179841000| 81.0|
|XXXXX01     |179841000|     |
|XXXXX02     |179842000| 81.0|
|XXXXX03     |179843000| 87.0|
|XXXXX02     |179849000|     |
|XXXXX02     |179850000|     |
|XXXXX01     |179850000| 88.0|
|XXXXX01     |179857000|     |
|XXXXX01     |179858000|     |
|XXXXX01     |179865000|     |
|XXXXX03     |179865000|     |
|XXXXX02     |179870000|     |
|XXXXX02     |179871000| 11  |
+--------------------+-------+
Run Code Online (Sandbox Code Playgroud)

以上数据已按TIME列排序。

我的要求是将NUM_ID列分组,如下所示。

+------------+---------+-----+
|      NUM_ID|    TIME |SIG_V|
+------------+---------+-----+
|XXXXX01     |167499000|55   |
|XXXXX01     |167503000|     |
|XXXXX01     |179833000|     |
|XXXXX01     |179841000|     |
|XXXXX01     |179850000| 88.0|
|XXXXX01     |179857000|     |
|XXXXX01     |179858000|     |
|XXXXX01     |179865000|     |
|XXXXX02     |167499000|     |
|XXXXX02     |179810000| 81.0|
|XXXXX02     |179811000| 81.0|
|XXXXX02     |179833000|     |
|XXXXX02     |179841000| 81.0|
|XXXXX02     |179849000|     |
|XXXXX02     |179850000|     |
|XXXXX02     |179842000| 81.0|
|XXXXX02     |179870000|     |
|XXXXX02     |179871000| 11  |
|XXXXX03     |179843000| 87.0|
|XXXXX03     |179865000|     |
+--------------------+-------+
Run Code Online (Sandbox Code Playgroud)

现在已将NUM_ID列分组,并且每个NUM_ID的TIME均已排序。

我尝试将groupBy和orderBy应用于无法正常工作的数据框。

val df2 =  df1.withColumn("SIG_V", col("SIG")).orderBy("TIME").groupBy("NUM_ID")
Run Code Online (Sandbox Code Playgroud)

并在df2.show时获取错误

error: value orderBy is not a member of org.apache.spark.sql.RelationalGroupedDataset
Run Code Online (Sandbox Code Playgroud)

有导致要求的线索吗?

chl*_*bek 5

您不需要groupBy,只需将两列放在orderBy

scala> df.show()
+---+---+
| _1| _2|
+---+---+
|  1|  3|
|  2|  2|
|  1|  4|
|  1|  1|
|  2|  0|
|  1| 10|
|  2|  5|
+---+---+


scala> df.orderBy('_1,'_2).show()
+---+---+
| _1| _2|
+---+---+
|  1|  1|
|  1|  3|
|  1|  4|
|  1| 10|
|  2|  0|
|  2|  2|
|  2|  5|
+---+---+
Run Code Online (Sandbox Code Playgroud)