优化大表上的 LATERAL JOIN 查询

max*_*ire 9 postgresql performance optimization greatest-n-per-group postgresql-performance

我正在使用 Postgres 9.5。我有一个记录来自多个网站的页面点击量的表格。该表包含从 2016 年 1 月 1 日到 2016 年 6 月 30 日的大约 3200 万行。

CREATE TABLE event_pg (
   timestamp_        timestamp without time zone NOT NULL,
   person_id         character(24),
   location_host     varchar(256),
   location_path     varchar(256),
   location_query    varchar(256),
   location_fragment varchar(256)
);
Run Code Online (Sandbox Code Playgroud)

我正在尝试调整一个查询,该查询计算执行给定页面命中序列的人数。该查询旨在回答诸如“有多少人查看了主页,然后访问了帮助站点,然后查看了感谢页面”之类的问题?结果看起来像这样

?????????????????????????????????????????
?  home-page ? help site  ? thankyou    ?
?????????????????????????????????????????
? 10000      ? 9800       ?1500         ?
?????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

请注意数字正在减少,这是有道理的,因为查看主页的 10000 人 9800 继续访问了帮助站点,而其中 1500 人继续点击了感谢页面。

3 步序列的 SQL 使用横向连接,如下所示:

SELECT 
  sum(view_homepage) AS view_homepage,
  sum(use_help) AS use_help,
  sum(thank_you) AS thank_you
FROM (
  -- Get the first time each user viewed the homepage.
  SELECT X.person_id,
    1 AS view_homepage,
    min(timestamp_) AS view_homepage_time
  FROM event_pg X 
  WHERE X.timestamp_ between '2016-04-23 00:00:00.0' and timestamp '2016-04-30 23:59:59.999'
  AND X.location_host like '2015.testonline.ca'
  GROUP BY X.person_id
) e1 
LEFT JOIN LATERAL (
  SELECT
    Y.person_id,
    1 AS use_help,
    timestamp_ AS use_help_time
  FROM event_pg Y 
  WHERE 
    Y.person_id = e1.person_id AND
    location_host = 'helpcentre.testonline.ca' AND
    timestamp_ BETWEEN view_homepage_time AND timestamp '2016-04-30 23:59:59.999'
  ORDER BY timestamp_
  LIMIT 1
) e2 ON true 
LEFT JOIN LATERAL (
  SELECT
    1 AS thank_you,
    timestamp_ AS thank_you_time
  FROM event_pg Z 
  WHERE Z.person_id = e2.person_id AND
    location_fragment =  '/file/thank-you' AND
    timestamp_ BETWEEN use_help_time AND timestamp '2016-04-30 23:59:59.999'
  ORDER BY timestamp_
  LIMIT 1
) e3 ON true;
Run Code Online (Sandbox Code Playgroud)

我有一个索引timestamp_person_idlocation列。几天或几周的日期范围查询非常快(1 到 10 秒)。当我尝试对 1 月 1 日和 7 月 30 日之间的所有内容运行查询时,它变得缓慢。这需要一分钟多的时间。如果您比较下面的两个解释,您可以看到它不再使用时间戳索引,而是执行 Seq 扫描,因为索引不会为我们购买任何东西,因为我们正在查询“所有时间”,因此几乎所有表中的记录.

现在我意识到横向连接的嵌套循环性质会减慢它必须循环的更多记录,但是有什么方法可以加快这个查询以获取巨大的日期范围,以便更好地扩展?

Erw*_*ter 11

初步说明

  • 您正在使用奇数数据类型。character(24)? char(n)是一种过时的类型,几乎总是错误的选择。您有索引person_id并反复加入。integer由于多种原因,效率会更高。(或者bigint,如果您计划在表的生命周期内销毁超过 20 亿行。)相关:

  • LIKE没有通配符是没有意义的。使用=来代替。快点。
    x.location_host LIKE '2015.testonline.ca'
    x.location_host = '2015.testonline.ca'

  • 使用count(e1.*)orcount(*)代替1为每个子查询添加一个带有值的虚拟列。(除了最后一个 ( e3),您不需要任何实际数据。)

  • 有时将字符串文字转换为timestamp( timestamp '2016-04-30 23:59:59.999')有时不转换为 ( )。无论它是有道理的,然后去做所有的时间,或者没有,那就不要去做。
    它没有。与timestamp列相比,字符串文字timestamp无论如何都会被强制。所以你不需要一个明确的演员。

  • Postgres 数据类型timestamp最多有 6 个小数位。你的BETWEEN表情离开了极端情况。我用不易出错的表达式替换了它们。

索引

重要提示:为了优化性能,请创建多列索引
对于第一个子查询hp

CREATE INDEX event_pg_location_host_timestamp__idx
ON event_pg (location_host, timestamp_);
Run Code Online (Sandbox Code Playgroud)

或者,如果您可以从中获取仅索引扫描,请附加person_id到索引:

CREATE INDEX event_pg_location_host_timestamp__person_id_idx
ON event_pg (location_host, timestamp_, person_id);
Run Code Online (Sandbox Code Playgroud)

对于跨越大部分或所有表的非常大的时间范围,这个索引应该是可取的——它也支持hlp子查询,所以以任何一种方式创建它:

CREATE INDEX event_pg_location_host_person_id_timestamp__idx
ON event_pg (location_host, person_id, timestamp_);
Run Code Online (Sandbox Code Playgroud)

对于tnk

CREATE INDEX event_pg_location_fragment_timestamp__idx
ON event_pg (location_fragment, person_id, timestamp_);
Run Code Online (Sandbox Code Playgroud)

使用部分索引进行优化

如果你的谓词location_hostlocation_fragment是常量,我们可以使用更便宜的部分索引来代替,特别是因为你的location_*列看起来很大:

CREATE INDEX event_pg_hp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_host = '2015.testonline.ca';

CREATE INDEX event_pg_hlp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_host = 'helpcentre.testonline.ca';

CREATE INDEX event_pg_tnk_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_fragment = '/file/thank-you';
Run Code Online (Sandbox Code Playgroud)

考虑:

同样,所有这些指标是具有基本上更小,更快integerbigintperson_id

通常,您需要ANALYZE在创建新索引后访问该表 - 或者等到 autovacuum 为您执行此操作。

要获得仅索引扫描,您的表必须VACUUM足够。之后立即测试VACUUM作为概念证明。如果您不熟悉仅索引扫描,请阅读链接的 Postgres Wiki 页面以了解详细信息。

基本查询

实施我讨论的内容。查询小范围(每行person_id):

SELECT count(*)::int           AS view_homepage
     , count(hlp.hlp_ts)::int AS use_help
     , count(tnk.yes)::int     AS thank_you
FROM  (
   SELECT DISTINCT ON (person_id)
          person_id, timestamp_ AS hp_ts
   FROM   event_pg
   WHERE  timestamp_ >= '2016-04-23'
   AND    timestamp_ <  '2016-05-01'
   AND    location_host = '2015.testonline.ca'
   ORDER  BY person_id, timestamp_
   ) hp
LEFT JOIN LATERAL (
   SELECT timestamp_ AS hlp_ts
   FROM   event_pg y 
   WHERE  y.person_id = hp.person_id
   AND    timestamp_ >= hp.hp_ts
   AND    timestamp_ <  '2016-05-01'
   AND    location_host = 'helpcentre.testonline.ca'
   ORDER  BY timestamp_
   LIMIT  1
   ) hlp ON true 
LEFT JOIN LATERAL (
   SELECT true AS yes                   -- we only need existence
   FROM   event_pg z
   WHERE  z.person_id = hp.person_id    -- we can use hp here
   AND    location_fragment = '/file/thank-you'
   AND    timestamp_ >= hlp.hlp_ts      -- this introduces dependency on hlp anyways.
   AND    timestamp_ <  '2016-05-01'
   ORDER  BY timestamp_
   LIMIT  1
   ) tnk ON true;
Run Code Online (Sandbox Code Playgroud)

DISTINCT ON每几行通常更便宜person_id。详细解释:

如果每个行有很多行person_id(更可能在更大的时间范围内),那么第1a章的这个答案中讨论的递归 CTE可以(快得多):

请参阅下面的集成。

优化和自动化最佳查询

这是一个古老的难题:一种查询技术最适用于较小的集合,另一种则适用于较大的集合。在您的特定情况下,我们从一开始就有一个非常好的指标——给定时间段的长度——我们可以用它来决定。

我们将其全部包装在一个 PL/pgSQL 函数中。DISTINCT ON当给定的时间段长于设定的阈值时,我的实现从rCTE切换:

CREATE OR REPLACE FUNCTION f_my_counts(_ts_low_inc timestamp, _ts_hi_excl timestamp)
  RETURNS TABLE (view_homepage int, use_help int, thank_you int) AS
$func$
BEGIN

CASE
WHEN _ts_hi_excl <= _ts_low_inc THEN
   RAISE EXCEPTION 'Timestamp _ts_hi_excl (1st param) must be later than _ts_low_inc!';

WHEN _ts_hi_excl - _ts_low_inc < interval '10 days' THEN  -- example value !!!
-- DISTINCT ON for few rows per person_id
   RETURN QUERY
   WITH hp AS (
      SELECT DISTINCT ON (person_id)
             person_id, timestamp_ AS hp_ts
      FROM   event_pg
      WHERE  timestamp_ >= _ts_low_inc
      AND    timestamp_ <  _ts_hi_excl
      AND    location_host = '2015.testonline.ca'
      ORDER  BY person_id, timestamp_
      )
    , hlp AS (
      SELECT hp.person_id, hlp.hlp_ts
      FROM   hp
      CROSS  JOIN LATERAL (
         SELECT timestamp_ AS hlp_ts
         FROM   event_pg
         WHERE  person_id = hp.person_id
         AND    timestamp_ >= hp.hp_ts
         AND    timestamp_ < _ts_hi_excl
         AND    location_host = 'helpcentre.testonline.ca'  -- match partial idx
         ORDER  BY timestamp_
         LIMIT  1
         ) hlp
      )
   SELECT (SELECT count(*)::int FROM hp)   -- AS view_homepage
        , (SELECT count(*)::int FROM hlp)  -- AS use_help
        , (SELECT count(*)::int            -- AS thank_you
           FROM   hlp
           CROSS  JOIN LATERAL (
              SELECT 1                     -- we only care for existence
              FROM   event_pg
              WHERE  person_id = hlp.person_id
              AND    location_fragment = '/file/thank-you'
              AND    timestamp_ >= hlp.hlp_ts
              AND    timestamp_ < _ts_hi_excl
              ORDER  BY timestamp_
              LIMIT  1
              ) tnk
           );

ELSE
-- rCTE for many rows per person_id
   RETURN QUERY
   WITH RECURSIVE hp AS (
      (  -- parentheses required
      SELECT person_id, timestamp_ AS hp_ts
      FROM   event_pg
      WHERE  timestamp_ >= _ts_low_inc
      AND    timestamp_ <  _ts_hi_excl
      AND    location_host = '2015.testonline.ca'  -- match partial idx
      ORDER  BY person_id, timestamp_
      LIMIT  1
      )
      UNION ALL
      SELECT x.*
      FROM   hp, LATERAL (
         SELECT person_id, timestamp_ AS hp_ts
         FROM   event_pg
         WHERE  person_id  > hp.person_id  -- lateral reference
         AND    timestamp_ >= _ts_low_inc  -- repeat conditions
         AND    timestamp_ <  _ts_hi_excl
         AND    location_host = '2015.testonline.ca'  -- match partial idx
         ORDER  BY person_id, timestamp_
         LIMIT  1
         ) x
      )
    , hlp AS (
      SELECT hp.person_id, hlp.hlp_ts
      FROM   hp
      CROSS  JOIN LATERAL (
         SELECT timestamp_ AS hlp_ts
         FROM   event_pg y 
         WHERE  y.person_id = hp.person_id
         AND    location_host = 'helpcentre.testonline.ca'  -- match partial idx
         AND    timestamp_ >= hp.hp_ts
         AND    timestamp_ < _ts_hi_excl
         ORDER  BY timestamp_
         LIMIT  1
         ) hlp
      )
   SELECT (SELECT count(*)::int FROM hp)   -- AS view_homepage
        , (SELECT count(*)::int FROM hlp)  -- AS use_help
        , (SELECT count(*)::int            -- AS thank_you
           FROM   hlp
           CROSS  JOIN LATERAL (
              SELECT 1                     -- we only care for existence
              FROM   event_pg
              WHERE  person_id = hlp.person_id
              AND    location_fragment = '/file/thank-you'
              AND    timestamp_ >= hlp.hlp_ts
              AND    timestamp_ < _ts_hi_excl
              ORDER  BY timestamp_
              LIMIT  1
              ) tnk
           );
END CASE;

END
$func$  LANGUAGE plpgsql STABLE STRICT;
Run Code Online (Sandbox Code Playgroud)

称呼:

SELECT * FROM f_my_counts('2016-01-23', '2016-05-01');
Run Code Online (Sandbox Code Playgroud)

根据定义,rCTE 与 CTE 一起使用。我还为查询添加了CTEDISTINCT ON(就像我在评论中与 @Lennart 讨论的那样),这允许我们使用CROSS JOIN而不是LEFT JOIN减少每个步骤的集合,因为我们可以单独计算每个 CTE。这会产生相反方向的效果:

  • 一方面,我们减少了行数,这将使第三个连接更便宜。
  • 另一方面,我们为 CTE 引入了开销,需要更多的 RAM,这对于像您这样的大型查询可能尤其重要。

你必须测试哪个比另一个更重要。


归档时间:

查看次数:

8114 次

最近记录:

8 年 前