HiveQL和排名()

Jef*_*lor 2 hadoop hive hiveql

我无法理解HiveQL排名().我在WWW上发现了几个排名UDF的实现,比如Edward的好例子.我可以加载和访问这些函数,但我不能让它们做我想做的事情.这是一个详细的例子:

将UDF加载到CLI进程中:

$ javac -classpath /home/hadoop/hadoop/hadoop-core-1.0.4.jar:/home/hadoop/hive/lib/hive-exec-0.10.0.jar com/m6d/hiveudf/Rank2.java 
$ jar -cvf Rank2.jar com/m6d/hiveudf/Rank2.class
hive> ADD JAR /home/hadoop/MyDemo/Rank2.jar;
hive> CREATE TEMPORARY FUNCTION Rank2 AS 'com.m6d.hiveudf.Rank2'; 
Run Code Online (Sandbox Code Playgroud)

创建一个表:

create table purchases (
  SalesRepId String, 
  PurchaseOrderId INT, 
  Amount INT
) 
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n';
Run Code Online (Sandbox Code Playgroud)

从此CSV加载数据:

Jana,1,100
Nadia,2,200
Nadia,3,600
Daniel,4,80
Jana,5,120
William,6,170
Daniel,7,140
Run Code Online (Sandbox Code Playgroud)

有了这个来自CLI:

LOAD DATA 
  LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
  INTO TABLE purchases;
Run Code Online (Sandbox Code Playgroud)

现在我可以看到我的顶级销售代表:

select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC;
Run Code Online (Sandbox Code Playgroud)

Nadia卖出了800美元的东西,Daniel和Jana都卖出了220美元,而William卖出了170美元

SalesRep    Amount
--------    ------
Nadia       800
Daniel      220
Jana        220
William     170
Run Code Online (Sandbox Code Playgroud)

现在我想做的就是为他们编号:Nadia排名第一,Daniel和Jana排在第2位,William排在第4位(不是#3)

select SalesRepId, V.volume,rank2(V.volume)
from 
(select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC) V;
Run Code Online (Sandbox Code Playgroud)

这就是我得到的,但不是我想要的:

SalesRep   Amount  Rank
--------   ------  ----
Nadia       800      1
Daniel      220      1
Jana        220      2
William     170      1
Run Code Online (Sandbox Code Playgroud)

这就是我想要的,但我不能让蜂巢为我做这件事:

SalesRep   Amount  Rank
--------   ------  ----
Nadia       800      1
Daniel      220      2
Jana        220      2
William     170      4
Run Code Online (Sandbox Code Playgroud)

你能用正确的HiveQL来帮助我对销售代表进行排名吗?

感谢JtheRocker的回应.他的变化导致了这个清单:

SalesRep   Amount  Rank
--------   ------  ----
William     170     1
Daniel      220     2
Jana        220     2
Nadia       800     3
Run Code Online (Sandbox Code Playgroud)

稍微修改一下,将Nadia显示为第4名(不是第3名):

private row_number;
@Override
public Object evaluate(DeferredObject[] currentKey) throws HiveException {
  row_number++;
  if (!sameAsPreviousKey(currentKey)) {
    this.counter = row_number;
    copyToPreviousKey(currentKey);
  }
return new Long(this.counter);
}
Run Code Online (Sandbox Code Playgroud)

lib*_*ack 7

通过Hive 0.11中引入的窗口和分析功能,您可以使用:

select SalesRepId, volume as amount , rank() over (order by V.volume desc) as rank from 
(select SalesRepId,sum(amount) as volume from purchases group by SalesRepId) V;
Run Code Online (Sandbox Code Playgroud)