Postgresql 11 逻辑复制 - 卡在`catchup` 状态

Question

Postgresql 11 逻辑复制 - 卡在`catchup` 状态

我正在运行两台 postgresql 11 服务器 - 主服务器和从服务器（使用逻辑复制进行设置）。

我面临的问题是，经过数周不间断工作后，今天奴隶与此错误消息不同步：

2019-09-16 07:39:44.332 CEST [30117] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 07:39:44.539 CEST [12932] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 07:39:44.542 CEST [27972] LOG:  background worker "logical replication worker" (PID 30117) exited with exit code 1

Run Code Online (Sandbox Code Playgroud)

我之前确实看到过此错误消息，我的过程是wal_sender_timeout在 master上增加（有关此的更多详细信息：postgresql 中的逻辑复制-“服务器意外关闭了连接”）

那么我想恢复复制但是复制状态卡在追赶上：

2019-09-16 07:39:44.332 CEST [30117] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 07:39:44.539 CEST [12932] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 07:39:44.542 CEST [27972] LOG:  background worker "logical replication worker" (PID 30117) exited with exit code 1

Run Code Online (Sandbox Code Playgroud)

我尝试重新启动从属几次，启用和禁用订阅的不同组合 - 没有任何帮助，复制状态保持 on catchup。我可以看到sent_lsn并且write_lsn价值观正在发生变化，因此正在发送某些内容...

这是我的奴隶配置：

wal_level=logical
max_replication_slots=2
max_logical_replication_workers=4

wal_receiver_timeout=1200000

Run Code Online (Sandbox Code Playgroud)

这是我的主人：

wal_level=logical

max_replication_slots=10
max_wal_senders=10

# maximum wait time in milliseconds that the walsender process on the active master
# waits for a status message from the walreceiver process on the standby master.
wal_sender_timeout=1200000

Run Code Online (Sandbox Code Playgroud)

我不知道该怎么做（更糟糕的是，在这个阶段我不知道接下来要检查什么......）

你能帮我理解我应该怎么做才能让我的奴隶赶上来，让它回到streaming状态吗？

编辑（12 小时后）

我早上入住的时候同步还在catchup状态

master=# select * from pg_stat_replication;
  pid  | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  |    write_lag    |    flush_lag    |   replay_lag    | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------
 86864 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46110 | 2019-09-16 12:45:56.491325+02 |              | catchup | D55/FA04D4B8 | D55/F9E74158 | D55/F9E44CD8 | D55/F9E74030 | 00:00:03.603104 | 00:00:03.603104 | 00:00:03.603104 |             0 | async
(1 row)

Run Code Online (Sandbox Code Playgroud)

但是当我 60 秒后再次检查时，结果集是空的......

日志现在显示同一错误的多个化身：

2019-09-16 22:43:33.841 CEST [20260] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 22:43:33.959 CEST [26087] LOG:  background worker "logical replication worker" (PID 20260) exited with exit code 1
2019-09-16 22:43:34.112 CEST [3510] LOG:  logical replication apply worker for subscription "logical_from_master" has started
(...)

Run Code Online (Sandbox Code Playgroud)

为了使复制显示catchup在主服务器上，我现在必须先重新启动从服务器...

编辑（回应@LaurenzAlbe 评论）

我昨天早上重建了副本，并观察到从晚上 19:53 开始复制再次失败。主服务器和副本的日志如下：

2019-09-18 19:15:13.767 CEST [8611] LOG:  logical replication table synchronization worker for subscription "logical_replica_from_master", table "lasttable" has finished
2019-09-18 19:54:14.875 CEST [11469] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-18 19:54:14.969 CEST [10330] LOG:  logical replication apply worker for subscription "logical_replica_from_master" has started
2019-09-18 19:54:15.031 CEST [11217] LOG:  background worker "logical replication worker" (PID 11469) exited with exit code 1

Run Code Online (Sandbox Code Playgroud)

来自master的相应日志：

2019-09-18 19:50:36.386 CEST,,,111051,,5d826e6a.1b1cb,1,,2019-09-18 19:50:34 CEST,138/28493452,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43783 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17925 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.05 s, elapsed: 1.88 s",,,,,,,,,""
2019-09-18 19:51:36.402 CEST,,,1714,,5d826ea6.6b2,1,,2019-09-18 19:51:34 CEST,316/16529009,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17925 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.01 s, system: 0.07 s, elapsed: 1.87 s",,,,,,,,,""
2019-09-18 19:52:36.421 CEST,,,2649,,5d826ee2.a59,1,,2019-09-18 19:52:34 CEST,153/19807659,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.03 s, system: 0.05 s, elapsed: 1.87 s",,,,,,,,,""
2019-09-18 19:53:36.424 CEST,,,2945,,5d826f1e.b81,1,,2019-09-18 19:53:34 CEST,317/15405278,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.03 s, system: 0.05 s, elapsed: 1.88 s",,,,,,,,,""
2019-09-18 19:54:15.123 CEST,"core","my_db",3073,"10.194.132.16:50372",5d826f47.c01,1,"idle",2019-09-18 19:54:15 CEST,317/0,0,LOG,00000,"starting logical decoding for slot ""logical_replica_from_master""","Streaming transactions committing after D5B/7A4D40, reading WAL from D5B/7A4D40.",,,,,,,,"logical_replica_from_master"
2019-09-18 19:54:15.124 CEST,"core","my_db",3073,"10.194.132.16:50372",5d826f47.c01,2,"idle",2019-09-18 19:54:15 CEST,317/0,0,LOG,00000,"logical decoding found consistent point at D5B/7A4D40","There are no running transactions.",,,,,,,,"logical_replica_from_master"
2019-09-18 19:54:36.442 CEST,,,3152,,5d826f5a.c50,1,,2019-09-18 19:54:34 CEST,362/5175766,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.02 s, system: 0.06 s, elapsed: 1.88 s",,,,,,,,,""

Run Code Online (Sandbox Code Playgroud)

然后在午夜左右对奴隶：

2019-09-19 00:16:48.167 CEST [10330] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-19 00:16:48.276 CEST [19530] LOG:  logical replication apply worker for subscription "logical_replica_from_master" has started
2019-09-19 00:16:48.324 CEST [11217] LOG:  background worker "logical replication worker" (PID 10330) exited with exit code 1

Run Code Online (Sandbox Code Playgroud)

和相应的登录master：

2019-09-19 00:15:41.104 CEST,,,74257,,5d82ac89.12211,1,,2019-09-19 00:15:37 CEST,78/34511468,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13603 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64816 remain, 64813 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27234 hits, 0 misses, 1 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.003 MB/s
system usage: CPU: user: 0.03 s, system: 0.08 s, elapsed: 2.85 s",,,,,,,,,""
2019-09-19 00:16:13.688 CEST,,,35656,,5d382555.8b48,11190,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 1748 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=174.932 s, sync=0.000 s, total=174.936 s; sync files=75, longest=0.000 s, average=0.000 s; distance=11366 kB, estimate=13499 kB",,,,,,,,,""
2019-09-19 00:16:41.121 CEST,,,75038,,5d82acc5.1251e,1,,2019-09-19 00:16:37 CEST,185/19338019,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13603 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64816 remain, 64813 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27233 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.07 s, elapsed: 2.85 s",,,,,,,,,""
2019-09-19 00:16:48.335 CEST,"core","my_db",75294,"10.194.132.16:50480",5d82acd0.1261e,1,"idle",2019-09-19 00:16:48 CEST,315/0,0,LOG,00000,"starting logical decoding for slot ""logical_replica_from_master""","Streaming transactions committing after D5B/1D1F1C0, reading WAL from D5B/1CA07F8.",,,,,,,,"logical_replica_from_master"
2019-09-19 00:16:48.335 CEST,"core","my_db",75294,"10.194.132.16:50480",5d82acd0.1261e,2,"idle",2019-09-19 00:16:48 CEST,315/0,0,LOG,00000,"logical decoding found consistent point at D5B/1CA07F8","There are no running transactions.",,,,,,,,"logical_replica_from_master"
2019-09-19 00:17:41.141 CEST,,,75484,,5d82ad01.126dc,1,,2019-09-19 00:17:37 CEST,330/18178915,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13613 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64866 remain, 64863 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27254 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.07 s, elapsed: 2.85 s",,,,,,,,,""

Run Code Online (Sandbox Code Playgroud)

编辑

以下错误消息（在 master 上）恰好在wal_sender_timeout时间之后出现（从在 slave 上启用复制开始）：

2019-09-19 13:33:58.015 CEST,"core","nzdb",112432,"10.194.132.16:50886",5d8362f5.1b730,5,"idle",2019-09-19 13:13:57 CEST,379/2076197,0,LOG,00000,"terminating walsender process due to replication timeout",,,,,"slot ""logical_replica_from_master"", output plugin ""pgoutput"", in the change callback, associated LSN D5B/6782CF0",,,"WalSndCheckTimeOut, walsender.c:2100","logical_replica_from_master"

Run Code Online (Sandbox Code Playgroud)

编辑

我向该服务器添加了更多 RAM，但观察结果仍然相同 -wal_sender_timeout在从服务器上的工作人员记录了上述错误后，在主服务器上，我只剩下以下内容pg_stat_replication：

  pid  | usesysid | usename |              application_name              |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  state  | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+--------------------------------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------
 87820 |    16680 | core    | logical_replica_from_master_27004_sync_21691 | 10.10.10.10 |                 |       55548 | 2019-09-19 15:31:40.032662+02 |   3142872730 | startup |          |           |           |            |           |           |            |             0 | async
(1 row)

Run Code Online (Sandbox Code Playgroud)

然后在很长一段时间后又回到了追赶状态，但情况有所不同 sent_lsn

如果我跑到INSERTmaster 上的测试表，那么我看不到 slave 上的变化。

Answer 1

小智 2

正如您已经发现的，您的主数据库太繁忙，单个复制工作人员无法处理所有更改。

您需要对表进行集群 - 但请确保以这样的方式进行操作：具有外键的表由同一工作人员处理，否则您可能会遇到外键约束将阻止数据插入到一个表中的情况，因为表外键关键点尚未更新。

归档时间：	6 年，1 月前
查看次数：	3112 次
最近记录：	6 年，1 月前