我们如何对数据帧进行排名?

use*_*666 8 scala apache-spark apache-spark-sql

我有以下示例数据框:

I/P

accountNumber   assetValue  
A100            1000         
A100            500          
B100            600          
B100            200          
Run Code Online (Sandbox Code Playgroud)

O/P

AccountNumber   assetValue  Rank
A100            1000         1
A100            500          2
B100            600          1
B100            200          2
Run Code Online (Sandbox Code Playgroud)

现在我的问题是我们如何在数据框上添加此排名列,该列按帐号排序.如果我需要在数据帧之外进行,我不期待大量的行如此开放.

我使用Spark版本1.5和SQLContext因此无法使用Windows功能

Psi*_*dom 7

您可以使用row_number函数和Window表达式来指定partitionorder列:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number

val df = Seq(("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)).toDF("accountNumber", "assetValue")

df.withColumn("rank", row_number().over(Window.partitionBy($"accountNumber").orderBy($"assetValue".desc))).show

+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
|         A100|      1000|   1|
|         A100|       500|   2|
|         B100|       600|   1|
|         B100|       200|   2|
+-------------+----------+----+
Run Code Online (Sandbox Code Playgroud)


Nay*_*rma 7

原始SQL:

val df = sc.parallelize(Seq(
  ("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)
)).toDF("accountNumber", "assetValue")

df.registerTempTable("df")
sqlContext.sql("SELECT accountNumber,assetValue, RANK() OVER (partition by accountNumber ORDER BY assetValue desc) AS rank FROM df").show


+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
|         A100|      1000|   1|
|         A100|       500|   2|
|         B100|       600|   1|
|         B100|       200|   2|
+-------------+----------+----+
Run Code Online (Sandbox Code Playgroud)