postgres_fdw 性能很慢

J-D*_*awG 13 postgresql performance postgresql-fdw postgresql-9.5 query-performance

以下对外部的查询在 320 万行上执行大约需要 5 秒:

SELECT x."IncidentTypeCode", COUNT(x."IncidentTypeCode") 
FROM "IntterraNearRealTimeUnitReflexes300sForeign" x 
WHERE x."IncidentDateTime" >= '05/01/2016' 
GROUP BY x."IncidentTypeCode" 
ORDER BY 1;
Run Code Online (Sandbox Code Playgroud)

当我在普通表上执行相同的查询时,它会在 0.6 秒内返回。执行计划完全不同:

普通表

SELECT x."IncidentTypeCode", COUNT(x."IncidentTypeCode") 
FROM "IntterraNearRealTimeUnitReflexes300sForeign" x 
WHERE x."IncidentDateTime" >= '05/01/2016' 
GROUP BY x."IncidentTypeCode" 
ORDER BY 1;
Run Code Online (Sandbox Code Playgroud)

外表

Sort  (cost=226861.20..226861.21 rows=4 width=4) (actual time=646.447..646.448 rows=7 loops=1) 
  Sort Key: "IncidentTypeCode" 
  Sort Method: quicksort  Memory: 25kB 
  -> HashAggregate (cost=226861.12..226861.16 rows=4 width=4) (actual  time=646.433..646.434 rows=7 loops=1)
     Group Key: "IncidentTypeCode"
     -> Bitmap Heap Scan on "IntterraNearRealTimeUnitReflexes300s" x  (cost=10597.63..223318.41 rows=708542 width=4) (actual time=74.593..342.110 rows=709376 loops=1) 
        Recheck Cond: ("IncidentDateTime" >= '2016-05-01 00:00:00'::timestamp without time zone) 
        Rows Removed by Index Recheck: 12259 
        Heap Blocks: exact=27052 lossy=26888
        -> Bitmap Index Scan on idx_incident_date_time_300  (cost=0.00..10420.49 rows=708542 width=0) (actual time=69.722..69.722 rows=709376 loops=1) 
           Index Cond: ("IncidentDateTime" >= '2016-05-01 00:00:00'::timestamp without time zone) 

Planning time: 0.165 ms 
Execution time: 646.512 ms
Run Code Online (Sandbox Code Playgroud)

我想我为该GROUP BY条款付出了高昂的代价,当我EXPLAIN VERBOSE

SELECT
    "IncidentTypeCode"
FROM
    PUBLIC ."IntterraNearRealTimeUnitReflexes300s"
WHERE
    (
        (
            "IncidentDateTime" >= '2016-05-01 00:00:00' :: TIMESTAMP WITHOUT TIME ZONE
        )
    )
Run Code Online (Sandbox Code Playgroud)

这将返回 700k 行。有没有解决的办法?

我昨天花了很多时间阅读这个文档页面,并以为我已经找到了设置use_remote_estimate为 true 的答案,但它没有任何效果。

如有必要,我确实可以访问外部服务器以创建对象。WHERE子句中的时间戳值可以是任何值;它不是来自预定义值的列表。

3ma*_*uek 8

如果您使用use_remote_estimate一定要运行ANALYZE外部表(我看到估计与返回的非常接近,您可能已经这样做了)。此外,下推改进在 <9.5 版本中不可用。我还假设您在远程服务器上具有相同的表结构(包括索引)。如果由于基数低而需要位图,由于下推机制的限制,它不会使用索引。您可能希望减少返回行的数量以强制进行 BTREE 索引扫描(时间戳范围)。不幸的是,如果过滤器返回表中 +10% 的行(如果规划器认为扫描整个表比查找读取便宜,则可能会改变这个百分比),则没有干净的方法来避免远程服务器上的 SeqScan。如果您使用的是 SSD,您可能会发现调整它很有用random_page_cost)。

您可以使用 CTE 来隔离 GROUP BY 行为:

WITH atable AS (
    SELECT "IncidentTypeCode"
    FROM PUBLIC ."IntterraNearRealTimeUnitReflexes300s"
    WHERE 
       ("IncidentDateTime" 
              BETWEEN '2016-05-01 00:00:00'::TIMESTAMP WITHOUT TIME ZONE 
                  AND '2016-05-02 00:00:00'::TIMESTAMP WITHOUT TIME ZONE)
)
SELECT atable."IncidentTypeCode", COUNT(atable.IncidentTypeCode) 
FROM atable
GROUP BY atable."IncidentTypeCode" 
ORDER BY atable."IncidentTypeCode";
Run Code Online (Sandbox Code Playgroud)