Dataframe中的两个最新记录

Par*_*rma 2 sql dataframe apache-spark

我有一个Apache Spark Dataframe,其中包含以下数据(ID,Name,DATE):

ID,Name,DATE
1,Anil,2000-06-02
1,Anil,2000-06-03
1,Anil,2000-06-04
2,Arun,2000-06-05
2,Arun,2000-06-06
2,Arun,2000-06-07
3,Anju,2000-06-08
3,Anju,2000-06-09
3,Anju,2000-06-10
4,Ram,2000-06-11
4,Ram,2000-06-02
4,Ram,2000-06-03
4,Ram,2000-06-04
5,Ramu,2000-06-05
5,Ramu,2000-06-06
5,Ramu,2000-06-07
5,Ramu,2000-06-08
6,Renu,2000-06-09
7,Gopu,2000-06-10
7,Gopu,2000-06-11
Run Code Online (Sandbox Code Playgroud)

但我想要ID的前两个最新记录,我想获得以下输出:

ID,Name,DATE
1,Anil,2000-06-03
1,Anil,2000-06-04
2,Arun,2000-06-06
2,Arun,2000-06-07
3,Anju,2000-06-09
3,Anju,2000-06-10
4,Ram,2000-06-03
4,Ram,2000-06-04
5,Ramu,2000-06-07
5,Ramu,2000-06-08
6,Renu,2000-06-09
7,Gopu,2000-06-10
7,Gopu,2000-06-11
Run Code Online (Sandbox Code Playgroud)

我是否需要使用像Lag这样的窗口函数?

Mat*_*att 5

使用LEFT OUTER JOIN具有COUNT<2.

SELECT d.ID, d.Name, d.Date
FROM Dataframetable d
LEFT OUTER JOIN Dataframetable d2 ON d2.ID = d.ID AND d.Date < d2.Date
GROUP BY d.ID, d.Name, d.Date
HAVING COUNT(*) < 2
Run Code Online (Sandbox Code Playgroud)

产量

ID  Name    Date
1   Anil    2000-06-03T00:00:00Z
1   Anil    2000-06-04T00:00:00Z
2   Arun    2000-06-06T00:00:00Z
2   Arun    2000-06-07T00:00:00Z
3   Anju    2000-06-09T00:00:00Z
3   Anju    2000-06-10T00:00:00Z
4   Ram     2000-06-04T00:00:00Z
4   Ram     2000-06-11T00:00:00Z
5   Ramu    2000-06-07T00:00:00Z
5   Ramu    2000-06-08T00:00:00Z
6   Renu    2000-06-09T00:00:00Z
7   Gopu    2000-06-10T00:00:00Z
7   Gopu    2000-06-11T00:00:00Z
Run Code Online (Sandbox Code Playgroud)

SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/1/0

使用Sub查询而不是自联接.

SELECT ID, name, date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date
FROM Dataframetable d
GROUP BY d.ID, d.Name
UNION ALL
SELECT d.ID, d.Name, MAX(d.Date)
FROM Dataframetable d
WHERE d.Date NOT IN 
(SELECT date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date
FROM Dataframetable d
GROUP BY d.ID, d.Name) a)
GROUP BY d.ID, d.Name) b
ORDER BY ID
Run Code Online (Sandbox Code Playgroud)

SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/19/0

  • @PardeepSharma添加了不需要自我加入的答案,虽然我想这会慢一些. (2认同)
  • 不一定,只是假设它,我不知道你的数据,用解释计划运行两个查询,并为自己看. (2认同)