Scala - Groupby和Max on pair RDD

Aja*_*jay 0 scala apache-spark

我是spark scala的新手,想要找到每个部门的最高工资

Dept,Salary
Dept1,1000
Dept2,2000
Dept1,2500
Dept2,1500
Dept1,1700
Dept2,2800
Run Code Online (Sandbox Code Playgroud)

我实现了以下代码

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf


object MaxSalary {
  val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))

  case class Dept(dept_name : String, Salary : Int)

  val data = sc.textFile("file:///home/user/Documents/dept.txt").map(_.split(","))
  val recs = data.map(r => (r(0), Dept(r(0), r(1).toInt)))
  val a = recs.max()???????
})
}
Run Code Online (Sandbox Code Playgroud)

但坚持如何实现group by和max功能.我正在使用对RDD.

谢谢

phi*_*ert 5

这可以使用具有以下代码的RDD来完成:

val emp = sc.textFile("file:///home/user/Documents/dept.txt")
            .mapPartitionsWithIndex( (idx, row) => if(idx==0) row.drop(1) else row )
            .map(x => (x.split(",")(0).toString, x.split(",")(1).toInt))

val maxSal = emp.reduceByKey(math.max(_,_))
Run Code Online (Sandbox Code Playgroud)

应该给你:

Array[(String, Int)] = Array((Dept1,2500), (Dept2,2800))
Run Code Online (Sandbox Code Playgroud)