为什么 Postgres 闲置 95%，没有文件 I/O？

Question

为什么 Postgres 闲置 95%，没有文件 I/O？

Ste*_*ett 8 postgresql performance bottleneck diagnostic postgis

我有一个 TileMill/PostGIS 堆栈在 OpenStack 云上的 8 核 Ubuntu 12.04 VM 上运行。这是一个非常相似的系统的重建，该系统上周在非常相似的硬件（相同的云，但不同的物理硬件，我相信）上运行良好。我试图重建与原来完全相同的堆栈（使用我构建的一些脚本）。

一切都在运行，但数据库执行查询的速度非常缓慢，这最终表现为非常缓慢的 tile 生成。一个示例查询（计算澳大利亚每个城镇半径内的酒吧数量），以前需要 10-20 秒，现在需要 10 多分钟：

explain (analyze, buffers) update places set pubs = 
(select count(*) from planet_osm_point p where p.amenity = 'pub' and st_dwithin(p.way,places.way,scope)) +
(select count(*) from planet_osm_polygon p where p.amenity = 'pub' and st_dwithin(p.way,places.way,scope)) ;
 Update on places  (cost=0.00..948254806.93 rows=9037 width=160) (actual time=623321.558..623321.558 rows=0 loops=1)
   Buffers: shared hit=132126300
   ->  Seq Scan on places  (cost=0.00..948254806.93 rows=9037 width=160) (actual time=68.130..622931.130 rows=9037 loops=1)
         Buffers: shared hit=132107781
         SubPlan 1
           ->  Aggregate  (cost=12.95..12.96 rows=1 width=0) (actual time=0.187..0.188 rows=1 loops=9037)
                 Buffers: shared hit=158171
                 ->  Index Scan using planet_osm_point_index on planet_osm_point p  (cost=0.00..12.94 rows=1 width=0) (actual time=0.163..0.179 rows=0 loops=9037)
                       Index Cond: (way && st_expand(places.way, (places.scope)::double precision))
                       Filter: ((amenity = 'pub'::text) AND (places.way && st_expand(way, (places.scope)::double precision)) AND _st_dwithin(way, places.way, (places.scope)::double precision))
                       Buffers: shared hit=158171
         SubPlan 2
           ->  Aggregate  (cost=104917.24..104917.25 rows=1 width=0) (actual time=68.727..68.728 rows=1 loops=9037)
                 Buffers: shared hit=131949237
                 ->  Seq Scan on planet_osm_polygon p  (cost=0.00..104917.24 rows=1 width=0) (actual time=68.138..68.716 rows=0 loops=9037)
                       Filter: ((amenity = 'pub'::text) AND (way && st_expand(places.way, (places.scope)::double precision)) AND (places.way && st_expand(way, (places.scope)::double precision)) AND _st_dwithin(way, places.way, (places.scope)::double precision))
                       Buffers: shared hit=131949237
 Total runtime: 623321.801 ms

Run Code Online (Sandbox Code Playgroud)

（我将此查询作为症状包括在内，而不是直接解决要解决的问题。此特定查询仅每周运行一次。）

服务器有 32 GB 的 RAM，我已按如下方式配置 Postgres（遵循网上的建议）：

shared_buffers = 8GB
autovacuum = on
effective_cache_size = 8GB
work_mem = 128MB
maintenance_work_mem = 64MB
wal_buffers = 1MB
checkpoint_segments = 10

Run Code Online (Sandbox Code Playgroud)

iostat 显示没有读取任何数据，写入一些数据（不知道在哪里或为什么），以及 95% 空闲 CPU：

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.40    0.00    0.00    0.11    0.00   94.49

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda               0.20         0.00         0.80          0          8
vdb               2.30         0.00        17.58          0        176

Run Code Online (Sandbox Code Playgroud)

示例输出vmstat：

  procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
...
 1  0      0 18329748 126108 12600436    0    0     0    18  148  140  5  0 95  0
 2  0      0 18329400 126124 12600436    0    0     0     9  173  228  5  0 95  0

Run Code Online (Sandbox Code Playgroud)

我迫不及待地将 Postgres 数据目录从 vda 移到了 vdb，但这当然没有任何区别。

所以我不知所措。为什么 Postgres 在不等待任何 I/O 时只使用 5% 的可用 CPU？我欢迎任何关于进一步调查、其他工具、随机尝试的建议。

更新

我对服务器进行了快照，并在同一云的不同部分（不同的可用区）上启动了它。结果有点奇怪。vmstat在此服务器上报告 12% 的 CPU 使用率（我现在理解为在 8 核 VM 上进行单个 Postgres 查询的预期值）-尽管实际查询执行时间几乎相同（630 秒与 623 秒）。

我现在意识到，由于这个原因，这个特定的查询可能不是一个好的示例：它只能使用一个核心，而且它是一个update（而平铺渲染只是selects）。

我也没有注意到explain显然planet_osm_polygon没有使用索引。这很可能是原因，所以我接下来会追这个。

更新2

问题似乎肯定是正在/没有使用planet_osm_polygon 索引。有两个（一个由 osm2pgsql 创建，一个由我按照一些随机指南创建）：

CREATE INDEX idx_planet_osm_polygon_tags
  ON planet_osm_polygon
  USING gist
  (tags);


CREATE INDEX planet_osm_polygon_pkey
  ON planet_osm_polygon
  USING btree
  (osm_id);

Run Code Online (Sandbox Code Playgroud)

在planet_osm_polygon 和planet_osm_point 上的统计数据非常具有启发性，我认为：

Planet_osm_polygon：

Sequential Scans    194204  
Sequential Tuples Read  60981018608 
Index Scans 1574    
Index Tuples Fetched    0

Run Code Online (Sandbox Code Playgroud)

Planet_osm_point:

Sequential Scans    1142    
Sequential Tuples Read  12960604    
Index Scans 183454  
Index Tuples Fetched    43427685

Run Code Online (Sandbox Code Playgroud)

如果我没看错，Postgres 已经搜索了planet_osm_polygon 1574 次，但实际上从未找到任何东西，因此进行了大量的蛮力搜索。

新问题：为什么？

谜团已揭开

感谢Frederik Ramm 的回答，答案非常简单：由于某种原因，没有空间索引。再生它们是微不足道的：

create index planet_osm_polygon_polygon on planet_osm_polygon using gist(way);
create index planet_osm_polygon_point on planet_osm_point using gist(way);

Run Code Online (Sandbox Code Playgroud)

现在运行该查询需要 4.6 秒。空间指标很重要！:)

Answer 1

Mar*_*erg 4

通过explain.depesz.com运行Explain Anlayze 输出会突出显示大部分缓慢来自此操作：

Seq Scan on planet_osm_polygon p

Run Code Online (Sandbox Code Playgroud)

之前有索引吗？现在可以索引吗？

通过搜索该问题区域，我还在开放街道地图网站上找到了相关的问答：

本地平铺服务器 - 渲染速度极慢

归档时间：	12 年，5 月前
查看次数：	4443 次
最近记录：	10 年，1 月前