Gre*_*0ry 6 postgresql replication
我正在运行两台 postgresql 11 服务器 - 主服务器和从服务器(使用逻辑复制进行设置)。
我面临的问题是,经过数周不间断工作后,今天奴隶与此错误消息不同步:
2019-09-16 07:39:44.332 CEST [30117] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-16 07:39:44.539 CEST [12932] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 07:39:44.542 CEST [27972] LOG: background worker "logical replication worker" (PID 30117) exited with exit code 1
Run Code Online (Sandbox Code Playgroud)
我之前确实看到过此错误消息,我的过程是wal_sender_timeout在 master上增加(有关此的更多详细信息:postgresql 中的逻辑复制-“服务器意外关闭了连接”)
那么我想恢复复制但是复制状态卡在追赶上:
2019-09-16 07:39:44.332 CEST [30117] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-16 07:39:44.539 CEST [12932] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 07:39:44.542 CEST [27972] LOG: background worker "logical replication worker" (PID 30117) exited with exit code 1
Run Code Online (Sandbox Code Playgroud)
我尝试重新启动从属几次,启用和禁用订阅的不同组合 - 没有任何帮助,复制状态保持 on catchup。我可以看到sent_lsn并且write_lsn价值观正在发生变化,因此正在发送某些内容...
这是我的奴隶配置:
wal_level=logical
max_replication_slots=2
max_logical_replication_workers=4
wal_receiver_timeout=1200000
Run Code Online (Sandbox Code Playgroud)
这是我的主人:
wal_level=logical
max_replication_slots=10
max_wal_senders=10
# maximum wait time in milliseconds that the walsender process on the active master
# waits for a status message from the walreceiver process on the standby master.
wal_sender_timeout=1200000
Run Code Online (Sandbox Code Playgroud)
我不知道该怎么做(更糟糕的是,在这个阶段我不知道接下来要检查什么......)
你能帮我理解我应该怎么做才能让我的奴隶赶上来,让它回到streaming状态吗?
编辑(12 小时后)
我早上入住的时候同步还在catchup状态
master=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------
86864 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46110 | 2019-09-16 12:45:56.491325+02 | | catchup | D55/FA04D4B8 | D55/F9E74158 | D55/F9E44CD8 | D55/F9E74030 | 00:00:03.603104 | 00:00:03.603104 | 00:00:03.603104 | 0 | async
(1 row)
Run Code Online (Sandbox Code Playgroud)
但是当我 60 秒后再次检查时,结果集是空的......
日志现在显示同一错误的多个化身:
2019-09-16 22:43:33.841 CEST [20260] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-16 22:43:33.959 CEST [26087] LOG: background worker "logical replication worker" (PID 20260) exited with exit code 1
2019-09-16 22:43:34.112 CEST [3510] LOG: logical replication apply worker for subscription "logical_from_master" has started
(...)
Run Code Online (Sandbox Code Playgroud)
为了使复制显示catchup在主服务器上,我现在必须先重新启动从服务器...
编辑(回应@LaurenzAlbe 评论)
我昨天早上重建了副本,并观察到从晚上 19:53 开始复制再次失败。主服务器和副本的日志如下:
2019-09-18 19:15:13.767 CEST [8611] LOG: logical replication table synchronization worker for subscription "logical_replica_from_master", table "lasttable" has finished
2019-09-18 19:54:14.875 CEST [11469] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-18 19:54:14.969 CEST [10330] LOG: logical replication apply worker for subscription "logical_replica_from_master" has started
2019-09-18 19:54:15.031 CEST [11217] LOG: background worker "logical replication worker" (PID 11469) exited with exit code 1
Run Code Online (Sandbox Code Playgroud)
来自master的相应日志:
2019-09-18 19:50:36.386 CEST,,,111051,,5d826e6a.1b1cb,1,,2019-09-18 19:50:34 CEST,138/28493452,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43783 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17925 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.05 s, elapsed: 1.88 s",,,,,,,,,""
2019-09-18 19:51:36.402 CEST,,,1714,,5d826ea6.6b2,1,,2019-09-18 19:51:34 CEST,316/16529009,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17925 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.01 s, system: 0.07 s, elapsed: 1.87 s",,,,,,,,,""
2019-09-18 19:52:36.421 CEST,,,2649,,5d826ee2.a59,1,,2019-09-18 19:52:34 CEST,153/19807659,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.03 s, system: 0.05 s, elapsed: 1.87 s",,,,,,,,,""
2019-09-18 19:53:36.424 CEST,,,2945,,5d826f1e.b81,1,,2019-09-18 19:53:34 CEST,317/15405278,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.03 s, system: 0.05 s, elapsed: 1.88 s",,,,,,,,,""
2019-09-18 19:54:15.123 CEST,"core","my_db",3073,"10.194.132.16:50372",5d826f47.c01,1,"idle",2019-09-18 19:54:15 CEST,317/0,0,LOG,00000,"starting logical decoding for slot ""logical_replica_from_master""","Streaming transactions committing after D5B/7A4D40, reading WAL from D5B/7A4D40.",,,,,,,,"logical_replica_from_master"
2019-09-18 19:54:15.124 CEST,"core","my_db",3073,"10.194.132.16:50372",5d826f47.c01,2,"idle",2019-09-18 19:54:15 CEST,317/0,0,LOG,00000,"logical decoding found consistent point at D5B/7A4D40","There are no running transactions.",,,,,,,,"logical_replica_from_master"
2019-09-18 19:54:36.442 CEST,,,3152,,5d826f5a.c50,1,,2019-09-18 19:54:34 CEST,362/5175766,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.02 s, system: 0.06 s, elapsed: 1.88 s",,,,,,,,,""
Run Code Online (Sandbox Code Playgroud)
然后在午夜左右对奴隶:
2019-09-19 00:16:48.167 CEST [10330] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019-09-19 00:16:48.276 CEST [19530] LOG: logical replication apply worker for subscription "logical_replica_from_master" has started
2019-09-19 00:16:48.324 CEST [11217] LOG: background worker "logical replication worker" (PID 10330) exited with exit code 1
Run Code Online (Sandbox Code Playgroud)
和相应的登录master:
2019-09-19 00:15:41.104 CEST,,,74257,,5d82ac89.12211,1,,2019-09-19 00:15:37 CEST,78/34511468,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13603 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64816 remain, 64813 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27234 hits, 0 misses, 1 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.003 MB/s
system usage: CPU: user: 0.03 s, system: 0.08 s, elapsed: 2.85 s",,,,,,,,,""
2019-09-19 00:16:13.688 CEST,,,35656,,5d382555.8b48,11190,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 1748 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=174.932 s, sync=0.000 s, total=174.936 s; sync files=75, longest=0.000 s, average=0.000 s; distance=11366 kB, estimate=13499 kB",,,,,,,,,""
2019-09-19 00:16:41.121 CEST,,,75038,,5d82acc5.1251e,1,,2019-09-19 00:16:37 CEST,185/19338019,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13603 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64816 remain, 64813 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27233 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.07 s, elapsed: 2.85 s",,,,,,,,,""
2019-09-19 00:16:48.335 CEST,"core","my_db",75294,"10.194.132.16:50480",5d82acd0.1261e,1,"idle",2019-09-19 00:16:48 CEST,315/0,0,LOG,00000,"starting logical decoding for slot ""logical_replica_from_master""","Streaming transactions committing after D5B/1D1F1C0, reading WAL from D5B/1CA07F8.",,,,,,,,"logical_replica_from_master"
2019-09-19 00:16:48.335 CEST,"core","my_db",75294,"10.194.132.16:50480",5d82acd0.1261e,2,"idle",2019-09-19 00:16:48 CEST,315/0,0,LOG,00000,"logical decoding found consistent point at D5B/1CA07F8","There are no running transactions.",,,,,,,,"logical_replica_from_master"
2019-09-19 00:17:41.141 CEST,,,75484,,5d82ad01.126dc,1,,2019-09-19 00:17:37 CEST,330/18178915,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13613 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64866 remain, 64863 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27254 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.07 s, elapsed: 2.85 s",,,,,,,,,""
Run Code Online (Sandbox Code Playgroud)
编辑
以下错误消息(在 master 上)恰好在wal_sender_timeout时间之后出现(从在 slave 上启用复制开始):
2019-09-19 13:33:58.015 CEST,"core","nzdb",112432,"10.194.132.16:50886",5d8362f5.1b730,5,"idle",2019-09-19 13:13:57 CEST,379/2076197,0,LOG,00000,"terminating walsender process due to replication timeout",,,,,"slot ""logical_replica_from_master"", output plugin ""pgoutput"", in the change callback, associated LSN D5B/6782CF0",,,"WalSndCheckTimeOut, walsender.c:2100","logical_replica_from_master"
Run Code Online (Sandbox Code Playgroud)
编辑
我向该服务器添加了更多 RAM,但观察结果仍然相同 -wal_sender_timeout在从服务器上的工作人员记录了上述错误后,在主服务器上,我只剩下以下内容pg_stat_replication:
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+--------------------------------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------
87820 | 16680 | core | logical_replica_from_master_27004_sync_21691 | 10.10.10.10 | | 55548 | 2019-09-19 15:31:40.032662+02 | 3142872730 | startup | | | | | | | | 0 | async
(1 row)
Run Code Online (Sandbox Code Playgroud)
然后在很长一段时间后又回到了追赶状态,但情况有所不同 sent_lsn
如果我跑到INSERTmaster 上的测试表,那么我看不到 slave 上的变化。
小智 2
正如您已经发现的,您的主数据库太繁忙,单个复制工作人员无法处理所有更改。
您需要对表进行集群 - 但请确保以这样的方式进行操作:具有外键的表由同一工作人员处理,否则您可能会遇到外键约束将阻止数据插入到一个表中的情况,因为表外键关键点尚未更新。
| 归档时间: |
|
| 查看次数: |
3112 次 |
| 最近记录: |