SPARK DataFrame:删除组中的MAX值

use*_*925 6 dataframe apache-spark apache-spark-sql

我的数据如下:

id | val
---------------- 
a1 |  10
a1 |  20
a2 |  5
a2 |  7
a2 |  2
Run Code Online (Sandbox Code Playgroud)

如果我在"id"上分组,我试图删除组中具有MAX(val)的行.

结果应该是:

id | val
---------------- 
a1 |  10
a2 |  5
a2 |  2
Run Code Online (Sandbox Code Playgroud)

我正在使用SPARK DataFrame和SQLContext.我需要一些方式:

DataFrame df = sqlContext.sql("SELECT * FROM jsontable WHERE (id, val) NOT IN (SELECT is,MAX(val) from jsontable GROUP BY id)");
Run Code Online (Sandbox Code Playgroud)

我怎样才能做到这一点?

mar*_*ios 0

以下是如何使用 RDD 和更具 Scala 风格的方法来做到这一点:

// Let's first get the data in key-value pair format
val data = sc.makeRDD( Seq( ("a",20), ("a", 1), ("a",8), ("b",3), ("b",10), ("b",9) ) )

// Next let's find the max value from each group
val maxGroups = data.reduceByKey( Math.max(_,_) )

// We join the max in the group with the original data
val combineMaxWithData = maxGroups.join(data)

// Finally we filter out the values that agree with the max
val finalResults = combineMaxWithData.filter{ case (gid, (max,curVal)) => max != curVal }.map{ case (gid, (max,curVal)) => (gid,curVal) }


println( finalResults.collect.toList )
>List((a,1), (a,8), (b,3), (b,9))
Run Code Online (Sandbox Code Playgroud)