我在 postgres 中有一个具有以下结构的表
表路径:乘客、出发地、目的地、日期、月份、年份
我想根据一年中在一条路线上行驶的乘客数量找到前 3 条路线。路线上的乘客总数 (A <-> B) = 乘客总数 (A -> B) + 乘客总数 ( B->A )
聚合路线上乘客数量的最佳/最优方法是什么,表格行数约为 1.5 亿行。
谢谢
对此有两种方法。一个是聚合,另一个是连接。
select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;
Run Code Online (Sandbox Code Playgroud)
另一种是自连接。如果每个方向只有一行,则无需聚合即可执行此操作:
select p1.origin, p1.dest, p1.passengers + p2.passengers as numpassengers
from path p1 join
path pt2
on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
order by numpassengers desc
limit 3;
Run Code Online (Sandbox Code Playgroud)
否则,您需要自联接和聚合,因此第一种方法可能更快:
select p1.origin, p1.dest, sum(p1.passengers + p2.passengers) as numpassengers
from path p1 join
path pt2
on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
group by p1.origin, p1.dest
order by numpassengers desc
limit 3;
Run Code Online (Sandbox Code Playgroud)
我不知道哪个会更有效率。但是,我怀疑按总和计算的前 3 条路线将在每个方向的前 100 条中。如果是这样,请在 numpassengers 上建立一个索引,然后尝试:
select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t cross join
(select min(passengers) as cutoff
from (select distinct passengers
from path
order by passengers desc
limit 100
) t
) minp
where numpassengers >= minp.cutoff
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;
Run Code Online (Sandbox Code Playgroud)
截止的计算应该只使用索引,并大大减少其余查询的负载。
编辑:
如果您没有least()and greatest(),只需使用case语句:
select (case when origin < dest then origin else dest end) as od1,
(case when origin < dest then dest else origin end) as od2,
sum(passengers) as numpassengers
from path t
group by 1, 2
order by numpassengers
limit 3;
Run Code Online (Sandbox Code Playgroud)
您可以重复case的语句group by。但是 Amazon Redshift 允许您在group by子句中引用列别名或位置。
| 归档时间: |
|
| 查看次数: |
2402 次 |
| 最近记录: |