Par*_*rma 2 sql dataframe apache-spark
我有一个Apache Spark Dataframe,其中包含以下数据(ID,Name,DATE):
ID,Name,DATE
1,Anil,2000-06-02
1,Anil,2000-06-03
1,Anil,2000-06-04
2,Arun,2000-06-05
2,Arun,2000-06-06
2,Arun,2000-06-07
3,Anju,2000-06-08
3,Anju,2000-06-09
3,Anju,2000-06-10
4,Ram,2000-06-11
4,Ram,2000-06-02
4,Ram,2000-06-03
4,Ram,2000-06-04
5,Ramu,2000-06-05
5,Ramu,2000-06-06
5,Ramu,2000-06-07
5,Ramu,2000-06-08
6,Renu,2000-06-09
7,Gopu,2000-06-10
7,Gopu,2000-06-11
Run Code Online (Sandbox Code Playgroud)
但我想要ID的前两个最新记录,我想获得以下输出:
ID,Name,DATE
1,Anil,2000-06-03
1,Anil,2000-06-04
2,Arun,2000-06-06
2,Arun,2000-06-07
3,Anju,2000-06-09
3,Anju,2000-06-10
4,Ram,2000-06-03
4,Ram,2000-06-04
5,Ramu,2000-06-07
5,Ramu,2000-06-08
6,Renu,2000-06-09
7,Gopu,2000-06-10
7,Gopu,2000-06-11
Run Code Online (Sandbox Code Playgroud)
我是否需要使用像Lag这样的窗口函数?
使用LEFT OUTER JOIN具有COUNT<2.
SELECT d.ID, d.Name, d.Date
FROM Dataframetable d
LEFT OUTER JOIN Dataframetable d2 ON d2.ID = d.ID AND d.Date < d2.Date
GROUP BY d.ID, d.Name, d.Date
HAVING COUNT(*) < 2
Run Code Online (Sandbox Code Playgroud)
产量
ID Name Date
1 Anil 2000-06-03T00:00:00Z
1 Anil 2000-06-04T00:00:00Z
2 Arun 2000-06-06T00:00:00Z
2 Arun 2000-06-07T00:00:00Z
3 Anju 2000-06-09T00:00:00Z
3 Anju 2000-06-10T00:00:00Z
4 Ram 2000-06-04T00:00:00Z
4 Ram 2000-06-11T00:00:00Z
5 Ramu 2000-06-07T00:00:00Z
5 Ramu 2000-06-08T00:00:00Z
6 Renu 2000-06-09T00:00:00Z
7 Gopu 2000-06-10T00:00:00Z
7 Gopu 2000-06-11T00:00:00Z
Run Code Online (Sandbox Code Playgroud)
SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/1/0
使用Sub查询而不是自联接.
SELECT ID, name, date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date
FROM Dataframetable d
GROUP BY d.ID, d.Name
UNION ALL
SELECT d.ID, d.Name, MAX(d.Date)
FROM Dataframetable d
WHERE d.Date NOT IN
(SELECT date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date
FROM Dataframetable d
GROUP BY d.ID, d.Name) a)
GROUP BY d.ID, d.Name) b
ORDER BY ID
Run Code Online (Sandbox Code Playgroud)
SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/19/0
| 归档时间: |
|
| 查看次数: |
48 次 |
| 最近记录: |