RaA*_*aAm 7 sql scala window-functions apache-spark apache-spark-sql
我正在使用Window函数在Spark中实现累积和.但是在应用窗口分区功能时不保持记录输入的顺序
输入数据:
val base = List(List("10", "MILLER", "1300", "2017-11-03"), List("10", "Clark", "2450", "2017-12-9"), List("10", "King", "5000", "2018-01-28"),
List("30", "James", "950", "2017-10-18"), List("30", "Martin", "1250", "2017-11-21"), List("30", "Ward", "1250", "2018-02-05"))
.map(row => (row(0), row(1), row(2), row(3)))
val DS1 = base.toDF("dept_no", "emp_name", "sal", "date")
DS1.show()
Run Code Online (Sandbox Code Playgroud)
+-------+--------+----+----------+
|dept_no|emp_name| sal| date|
+-------+--------+----+----------+
| 10| MILLER|1300|2017-11-03|
| 10| Clark|2450| 2017-12-9|
| 10| King|5000|2018-01-28|
| 30| James| 950|2017-10-18|
| 30| Martin|1250|2017-11-21|
| 30| Ward|1250|2018-02-05|
+-------+--------+----+----------+
Run Code Online (Sandbox Code Playgroud)
预期产出:
+-------+--------+----+----------+-----------+
|dept_no|emp_name| sal| date|Dept_CumSal|
+-------+--------+----+----------+-----------+
| 10| MILLER|1300|2017-11-03| 1300.0|
| 10| Clark|2450| 2017-12-9| 3750.0|
| 10| King|5000|2018-01-28| 8750.0|
| 30| James| 950|2017-10-18| 950.0|
| 30| Martin|1250|2017-11-21| 2200.0|
| 30| Ward|1250|2018-02-05| 3450.0|
+-------+--------+----+----------+-----------+
Run Code Online (Sandbox Code Playgroud)
我尝试过以下逻辑
val baseDepCumSal = DS1.withColumn("Dept_CumSal", sum("sal").over(Window.partitionBy("dept_no").
orderBy(col("sal"), col("emp_name"), col("date").asc).
rowsBetween(Long.MinValue, 0)
))
baseDepCumSal.orderBy("dept_no", "date").show
Run Code Online (Sandbox Code Playgroud)
+-------+--------+----+----------+-----------+
|dept_no|emp_name| sal| date|Dept_CumSal|
+-------+--------+----+----------+-----------+
| 10| MILLER|1300|2017-11-03| 1300.0|
| 10| Clark|2450| 2017-12-9| 3750.0|
| 10| King|5000|2018-01-28| 8750.0|
| 30| James| 950|2017-10-18| 3450.0|
| 30| Martin|1250|2017-11-21| 1250.0|
| 30| Ward|1250|2018-02-05| 2500.0|
+-------+--------+----+----------+-----------+
Run Code Online (Sandbox Code Playgroud)
对于dept_no = 10,记录按预期顺序计算,而对于dept_no = 30,记录不按输入顺序计算.
发生这种情况是由于类型不正确。因为薪水是string
DS1.printSchema
root
|-- dept_no: string (nullable = true)
|-- emp_name: string (nullable = true)
|-- sal: string (nullable = true)
|-- date: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
它按字典顺序排序:
DS1.orderBy("sal").show
+-------+--------+----+----------+
|dept_no|emp_name| sal| date|
+-------+--------+----+----------+
| 30| Martin|1250|2017-11-21|
| 30| Ward|1250|2018-02-05|
| 10| MILLER|1300|2017-11-03|
| 10| Clark|2450| 2017-12-9|
| 10| King|5000|2018-01-28|
| 30| James| 950|2017-10-18|
+-------+--------+----+----------+
Run Code Online (Sandbox Code Playgroud)
为了获得理想的结果,您必须进行强制转换(并且不需要定义框架):
DS1.withColumn("Dept_CumSal", sum("sal").over(
Window
.partitionBy("dept_no")
.orderBy(col("sal").cast("integer"), col("emp_name"), col("date").asc))).show
+-------+--------+----+----------+-----------+
|dept_no|emp_name| sal| date|Dept_CumSal|
+-------+--------+----+----------+-----------+
| 30| James| 950|2017-10-18| 950.0|
| 30| Martin|1250|2017-11-21| 2200.0|
| 30| Ward|1250|2018-02-05| 3450.0|
| 10| MILLER|1300|2017-11-03| 1300.0|
| 10| Clark|2450| 2017-12-9| 3750.0|
| 10| King|5000|2018-01-28| 8750.0|
+-------+--------+----+----------+-----------+
Run Code Online (Sandbox Code Playgroud)
请注意,窗口内的顺序(col("sal"), col("emp_name"), col("date").asc)
与显示的顺序不同。"dept_no", "date"
为什么窗口中需要“sal”和“emp_name”?为什么不直接按日期订购?