sta*_* jr 5 postgresql count postgresql-12 query-performance
我正在尝试优化包含超过 8000 万行的表。需要 20 多分钟才能获得行计数结果。我尝试过集群、vacuum full 和重新索引,但性能没有提高。为了改进数据查询和检索,我需要配置或调整什么?我在 Windows 2019 下使用 Postgresql 12。
更新信息:
Run Code Online (Sandbox Code Playgroud)Explain query result using 'select count(*) from doc_details': Finalize Aggregate (cost=5554120.84..5554120.85 rows=1 width=8) (actual time=1249204.001..1249210.027 rows=1 loops=1) -> Gather (cost=5554120.63..5554120.83 rows=2 width=8) (actual time=1249203.642..1249210.020 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=5553120.63..5553120.63 rows=1 width=8) (actual time=1249153.615..1249153.616 rows=1 loops=3) -> Parallel Seq Scan on doc_details (cost=0.00..5456055.30 rows=38826130 width=0) (actual time=3.793..1245165.604 rows=31018949 loops=3) Planning Time: 1.290 ms Execution Time: 1249210.115 ms
(我不知道如何获取以 kb/mb 为单位的行大小)
机器信息:
表信息:
Table "public.doc_details"
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------------+-----------+----------+----------------------------------------------
id | integer | | not null | nextval('doc_details_id_seq'::regclass)
trans_ref_number | character varying(30) | | not null |
outbound_time | timestamp(0) without time zone | | |
lm_tracking | character varying(30) | | not null |
cargo_dealer_tracking | character varying(30) | | not null |
order_sn | character varying(30) | | |
operations_no | character varying(30) | | |
box_no | character varying(30) | | |
box_size | character varying(30) | | |
parcel_weight_kg | numeric(8,3) | | |
parcel_size | character varying(30) | | |
box_weight_kg | numeric(8,3) | | |
box_volume | integer | | |
parcel_volume | integer | | |
transportation | character varying(100) | | |
channel | character varying(30) | | |
service_code | character varying(20) | | |
country | character varying(60) | | |
destination_code | character varying(20) | | |
assignee_name | character varying(100) | | |
assignee_province_state | character varying(30) | | |
assignee_city | character varying(30) | | |
postal_code | character varying(20) | | |
assignee_telephone | character varying(30) | | |
assignee_address | text | | |
shipper_name | character varying(100) | | |
shipper_country | character varying(60) | | |
shipper_province | character varying(30) | | |
shipper_city | character varying(30) | | |
shipper_address | text | | |
shipper_telephone | character varying(30) | | |
package_qty | integer | | |
hs_code | integer | | |
hs_code_manual | integer | | |
reviewed | boolean | | |
created_at | timestamp(0) without time zone | | |
updated_at | timestamp(0) without time zone | | |
invalid | boolean | | |
arrival_id | integer | | |
excel_row_number | integer | | |
is_additional | boolean | | |
arrival_datetime | timestamp(6) without time zone | | |
invoice_date | timestamp without time zone | | |
unit_code | character varying(100) | | |
Indexes:
"doc_details_pkey" PRIMARY KEY, btree (id) CLUSTER
"doc_details_box_no_idx" btree (box_no)
"doc_details_trans_ref_number_idx" btree (trans_ref_number)
Triggers:
trigger_log_awb_box AFTER INSERT ON doc_details FOR EACH ROW EXECUTE FUNCTION log_awb_box()
Run Code Online (Sandbox Code Playgroud)
来自 PostgreSQL 维基:
原因与PostgreSQL中的MVCC实现有关。事实上,多个事务可以看到数据的不同状态,这意味着“COUNT(*)”不可能有直接的方法来汇总整个表中的数据。PostgreSQL 必须遍历所有行以确定可见性。这通常会导致顺序扫描读取表中每一行的信息。
参考: 慢计数(PostgreSQL Wiki)
因此,(对于 PostgreSQL)没有更快的方法来读取超过 9400 万行。PostgreSQL 将煞费苦心地逐行读取,如您的解释计划中所示。
通过允许 PostgreSQL 在内存中存储更多数据,增加文件shared_buffers中的设置可能有助于稍微缓解性能问题。postgresql.conf
设置数据库服务器用于共享内存缓冲区的内存量。默认值通常为 128 兆字节 (128MB),但如果您的内核设置不支持它(在 initdb 期间确定),则可能会更小。此设置必须至少为 128 KB。然而,为了获得良好的性能,通常需要明显高于最小值的设置。如果指定该值时不带单位,则将其视为块,即 BLCKSZ 字节,通常为 8kB。(BLCKSZ 的非默认值会更改最小值。)此参数只能在服务器启动时设置。
如果您有一个具有 1GB 或更多 RAM 的专用数据库服务器,则shared_buffers 的合理起始值为系统内存的 25%。在某些工作负载中,更大的shared_buffers设置也是有效的,但由于PostgreSQL也依赖于操作系统缓存,因此为shared_buffers分配超过40%的RAM不太可能比分配更小的RAM更好。较大的shared_buffers设置通常需要相应增加max_wal_size,以便将写入大量新数据或更改数据的过程分散在更长的时间内。
在RAM小于1GB的系统上,RAM的百分比较小是合适的,以便为操作系统留出足够的空间。
参考文献: 20.4。资源消耗(PostgreSQL 文档)