我正在寻找如何加速仅基于两个表的查询的想法。我添加了索引,运行真空。
表格:
CREATE TABLE vintage_unit
(
id integer NOT NULL,
vintage_unit_date date,
origination_month timestamp with time zone,
product character varying,
customer_type character varying,
balance numeric,
CONSTRAINT pk_id_loan PRIMARY KEY (id)
);
CREATE INDEX id_vintage_unit_date ON vintage_unit USING btree (vintage_unit_date);
CREATE INDEX ind_customer_type ON vintage_unit USING btree (customer_type COLLATE pg_catalog."default");
CREATE INDEX ind_origination_month ON vintage_unit USING btree (origination_month);
CREATE INDEX ind_product ON vintage_unit USING btree (product COLLATE pg_catalog."default");
CREATE TABLE performance_event
(
id integer,
event_date date
);
CREATE INDEX ind_event_date ON performance_event USING btree (event_date);
CREATE INDEX ind_id ON performance_event USING btree (id);
Run Code Online (Sandbox Code Playgroud)
在两个表中都有 10,000 行。
说明方案:http : //explain.depesz.com/s/2di
查询:http : //sqlfiddle.com/#!1/edece/1
PostgreSQL 9.1.9
注意:在查询中有函数time_distance
:
CREATE OR REPLACE FUNCTION time_distance(to_date date, from_date date, granularity character varying)
RETURNS integer AS
$BODY$
declare
distance int;
begin
if granularity = 'month' then
distance := (extract(months from age(date_trunc('month',to_date), date_trunc('month',from_date))))::int + 12*(extract(years from age(date_trunc('month',to_date), date_trunc('month',from_date))))::int;
elsif granularity = 'quarter' then
distance := trunc(((extract(months from age(date_trunc('quarter',to_date), date_trunc('quarter',from_date)))) + 12*(extract(years from age(date_trunc('quarter',to_date), date_trunc('quarter',from_date)))))/3)::integer;
elsif granularity = 'year' then
distance := ((trunc(extract(years from age(date_trunc('month',to_date), date_trunc('month',from_date)))) ))::integer;
else
distance:= -1;
end if;
return distance;
end;
$BODY$
LANGUAGE plpgsql IMMUTABLE
COST 100;
Run Code Online (Sandbox Code Playgroud)
询问:
with vintage_descriptors as (
select
origination_month,product,customer_type,balance ,
vintage_unit_date,
generate_series(date_trunc('month',vintage_unit_date),max_vintage_date,'1 month')::date as event_date,
time_distance(generate_series(date_trunc('month',vintage_unit_date),max_vintage_date,'1 month')::date, vintage_unit_date,'month') as distance
from(
select
origination_month,product,customer_type,balance ,
vintage_unit_date, max_vintage_date
from
vintage_unit,
(select max(event_date) as max_vintage_date from performance_event) x
group by
origination_month,product,customer_type,balance ,
vintage_unit_date,
max_vintage_date
) a
)
,vintage_unit_sums as(
select
origination_month,product,customer_type,balance ,
vintage_unit_date,
sum(1::int) as vintage_unit_weight,
count(*) as vintage_unit_count
from
vintage_unit
group by
origination_month,product,customer_type,balance ,
vintage_unit_date
)
,performance_event_sums as(
select
origination_month,product,customer_type,balance ,
vintage_unit_date,
date_trunc('month',event_date)::date as event_date,
sum(1::int) as event_weight
from
vintage_unit
join performance_event ve using (id)
group by
origination_month,product,customer_type,balance ,
vintage_unit_date,
date_trunc('month',event_date)::date
)
,vintage_csums as (
select
vd.*,
vs.event_weight,
sum(coalesce(event_weight,0))
over(partition by origination_month,product,customer_type,balance ,vintage_unit_date order by event_date) as event_weight_csum
from
vintage_descriptors vd
left join performance_event_sums vs using (origination_month,product,customer_type,balance ,vintage_unit_date,event_date)
)
,aggregation as (
select
origination_month,product,customer_type,balance ,
vd.distance,
sum(vintage_unit_weight) vintage_unit_weight,
sum(vintage_unit_count) vintage_unit_count,
sum(coalesce(event_weight,0)) as event_weight,
sum(coalesce(event_weight,0)) / sum(vintage_unit_weight) as event_weight_pct,
sum(coalesce(event_weight_csum,0)) as event_weight_csum,
sum(coalesce(event_weight_csum,0))/sum(coalesce(vintage_unit_weight,0)) as event_weight_csum_pct
from
vintage_descriptors vd
join vintage_unit_sums using (origination_month,product,customer_type,balance , vintage_unit_date)
left join vintage_csums using (origination_month,product,customer_type,balance , vintage_unit_date, event_date)
group by
origination_month,product,customer_type,balance ,
vd.distance
order by
origination_month,product,customer_type,balance ,
distance
)
select * , row_number() over(partition by origination_month,product,customer_type , distance order by balance) as rn from aggregation
Run Code Online (Sandbox Code Playgroud)
查询说明 此查询的目的是计算用于年份分析的数据。通常它使用以下数据结构:
我在这里使用的查询实际上是根据用户输入动态创建的。
过程:
origination_month, product, customer_type and balance
都是我想观察的一组(统计数据以 CTE 计算vintage_unit_sums
)performance_event_sums
)vintage_unit_date
(这表示为distance
之间vintage_unit_date
和event_date
并使用time_distance
函数计算)和给定组中的单位数(行),但只有那些事件可能发生在过去,最后一个可用数据点被认为是vintage_unit_date
整个数据集的最大值。(例如:如果vintage_unit_date
给定组是 2013-10-01 并且最大值vintage_unit_date
是2013-11-01
最大可用距离为 1。但是,当在同一组中我们有vintage_unit_date
2013-09-01`行时,则最大距离为 2。然后在计算上述提到距离 2 的比率,我不想包括最大距离小于 2 的行。)vintage_descriptors
用于为可用组和潜在距离的每个可能组合生成占位符(因为并非每个距离和每个组都有一个事件)。大约一年前,我认为这可以在 OLAP 中解决,并在 StackOverflow 上提出了这个问题。从那时起,我在 R 中创建了自己的解决方案,其中 SQL 是根据用户输入动态创建的,并将结果发送到 PostgreSQL。
看起来将 work_mem 提高一些 MB(例如 30)会显着加快速度。您还可以ORDER BY
从聚合 CTE 中删除该子句。同样,GROUP BY
第一个 CTE 的子查询中的 看起来毫无用处。另外,sum(1)
基本上是对非空列的计数,例如内部联接中使用的列。此外,看起来您通过 CTE 携带了一堆非聚合和非分组列 - 在最新可能的步骤中加入这些列通常是一个好主意。这样您可以减少某些步骤的内存消耗。我的意思是,例如,vd.*
在vintage_csums
.