改进 SQL 查询

Tom*_*eif 5 postgresql

我正在寻找如何加速仅基于两个表的查询的想法。我添加了索引,运行真空。

表格:

CREATE TABLE vintage_unit
(
  id integer NOT NULL,
  vintage_unit_date date,
  origination_month timestamp with time zone,
  product character varying,
  customer_type character varying,
  balance numeric,
  CONSTRAINT pk_id_loan PRIMARY KEY (id)
);

CREATE INDEX id_vintage_unit_date  ON vintage_unit USING btree (vintage_unit_date);   
CREATE INDEX ind_customer_type ON vintage_unit USING btree (customer_type COLLATE pg_catalog."default");
CREATE INDEX ind_origination_month ON vintage_unit USING btree  (origination_month);    
CREATE INDEX ind_product ON vintage_unit USING btree (product COLLATE pg_catalog."default");


CREATE TABLE performance_event
(
  id integer,
  event_date date
);

CREATE INDEX ind_event_date ON performance_event USING btree (event_date);
CREATE INDEX ind_id ON performance_event USING btree (id);
Run Code Online (Sandbox Code Playgroud)

在两个表中都有 10,000 行。

说明方案:http : //explain.depesz.com/s/2di

查询:http : //sqlfiddle.com/#!1/edece/1

PostgreSQL 9.1.9

注意:在查询中有函数time_distance

CREATE OR REPLACE FUNCTION time_distance(to_date date, from_date date, granularity character varying)
  RETURNS integer AS
$BODY$

declare
  distance  int;

begin

   if granularity = 'month' then
         distance := (extract(months from age(date_trunc('month',to_date), date_trunc('month',from_date))))::int + 12*(extract(years from age(date_trunc('month',to_date), date_trunc('month',from_date))))::int;
   elsif granularity = 'quarter' then
       distance := trunc(((extract(months from age(date_trunc('quarter',to_date), date_trunc('quarter',from_date)))) + 12*(extract(years from age(date_trunc('quarter',to_date), date_trunc('quarter',from_date)))))/3)::integer;
   elsif granularity = 'year' then
       distance := ((trunc(extract(years from age(date_trunc('month',to_date), date_trunc('month',from_date)))) ))::integer;
   else
       distance:= -1;
   end if;

    return distance;

end;
$BODY$
  LANGUAGE plpgsql IMMUTABLE
  COST 100;
Run Code Online (Sandbox Code Playgroud)

询问:

  with vintage_descriptors as (
    select
      origination_month,product,customer_type,balance ,
      vintage_unit_date, 
      generate_series(date_trunc('month',vintage_unit_date),max_vintage_date,'1 month')::date as event_date,
      time_distance(generate_series(date_trunc('month',vintage_unit_date),max_vintage_date,'1 month')::date, vintage_unit_date,'month') as distance
    from(
      select 
        origination_month,product,customer_type,balance ,
        vintage_unit_date, max_vintage_date
      from 
        vintage_unit, 
        (select max(event_date) as max_vintage_date from performance_event) x
      group by 
        origination_month,product,customer_type,balance ,
        vintage_unit_date,
        max_vintage_date
    ) a
  )

  ,vintage_unit_sums as(
      select
        origination_month,product,customer_type,balance ,
        vintage_unit_date, 
        sum(1::int) as vintage_unit_weight,
        count(*) as vintage_unit_count
      from
        vintage_unit
      group by 
        origination_month,product,customer_type,balance ,
        vintage_unit_date
  )

  ,performance_event_sums as(
      select
        origination_month,product,customer_type,balance ,
        vintage_unit_date, 
        date_trunc('month',event_date)::date as event_date, 
        sum(1::int) as event_weight
      from
        vintage_unit
        join performance_event ve using (id)
      group by 
        origination_month,product,customer_type,balance , 
        vintage_unit_date, 
        date_trunc('month',event_date)::date
  )

  ,vintage_csums as (
      select 
        vd.*,
        vs.event_weight,
        sum(coalesce(event_weight,0)) 
          over(partition by origination_month,product,customer_type,balance ,vintage_unit_date order by event_date) as event_weight_csum
      from 
        vintage_descriptors vd
        left join performance_event_sums vs using (origination_month,product,customer_type,balance ,vintage_unit_date,event_date)
  )

  ,aggregation as (
     select
       origination_month,product,customer_type,balance ,
        vd.distance,
        sum(vintage_unit_weight) vintage_unit_weight,
        sum(vintage_unit_count) vintage_unit_count,
        sum(coalesce(event_weight,0)) as event_weight,
        sum(coalesce(event_weight,0)) / sum(vintage_unit_weight) as event_weight_pct,
        sum(coalesce(event_weight_csum,0)) as event_weight_csum,
        sum(coalesce(event_weight_csum,0))/sum(coalesce(vintage_unit_weight,0)) as event_weight_csum_pct
     from
        vintage_descriptors  vd
        join vintage_unit_sums using  (origination_month,product,customer_type,balance , vintage_unit_date)
        left join vintage_csums using (origination_month,product,customer_type,balance , vintage_unit_date, event_date)
     group by
        origination_month,product,customer_type,balance ,
        vd.distance
     order by
        origination_month,product,customer_type,balance ,
        distance
  ) 

  select * , row_number() over(partition by origination_month,product,customer_type , distance order by balance) as rn from aggregation 
Run Code Online (Sandbox Code Playgroud)

查询说明 此查询的目的是计算用于年份分析的数据。通常它使用以下数据结构:

通用数据结构输入

我在这里使用的查询实际上是根据用户输入动态创建的。

过程:

  • 的每个组合origination_month, product, customer_type and balance都是我想观察的一组(统计数据以 CTE 计算vintage_unit_sums
  • 每个组都有事件数(聚合在 中performance_event_sums
  • 目标是计算在特定月份后发生的事件的比率vintage_unit_date(这表示为distance之间vintage_unit_dateevent_date并使用time_distance函数计算)和给定组中的单位数(行),但只有那些事件可能发生在过去,最后一个可用数据点被认为是vintage_unit_date整个数据集的最大值。(例如:如果vintage_unit_date给定组是 2013-10-01 并且最大值vintage_unit_date2013-11-01最大可用距离为 1。但是,当在同一组中我们有vintage_unit_date2013-09-01`行时,则最大距离为 2。然后在计算上述提到距离 2 的比率,我不想包括最大距离小于 2 的行。)
  • ctevintage_descriptors用于为可用组和潜在距离的每个可能组合生成占位符(因为并非每个距离和每个组都有一个事件)。

大约一年前,我认为这可以在 OLAP 中解决,并在 StackOverflow 上提出了这个问题。从那时起,我在 R 中创建了自己的解决方案,其中 SQL 是根据用户输入动态创建的,并将结果发送到 PostgreSQL。

dez*_*zso 1

看起来将 work_mem 提高一些 MB(例如 30)会显着加快速度。您还可以ORDER BY从聚合 CTE 中删除该子句。同样,GROUP BY第一个 CTE 的子查询中的 看起来毫无用处。另外,sum(1)基本上是对非空列的计数,例如内部联接中使用的列。此外,看起来您通过 CTE 携带了一堆非聚合和非分组列 - 在最新可能的步骤中加入这些列通常是一个好主意。这样您可以减少某些步骤的内存消耗。我的意思是,例如,vd.*vintage_csums.