事后分析:PostgreSQL 复制失败

Gon*_*uez 2 postgresql replication data-synchronization postgresql-9.4 master-slave-replication

我们有一个 PostgreSQL 9.4.9 生产服务器,它正在复制到一个从属实例,但今天我发现该实例不同步!

显而易见的操作是重新创建从属节点,为复制活动设置指标和适当的警报,因此我们可以有效地监控主节点和从属节点之间的同步状态。

但是,由于同步失败,我想首先诊断问题并尝试确定其根本原因,因为这将是大约 6 个月内第二次发生这种情况。

问题:如何诊断复制过程中失败的内容,以便这次可以以更好的方式完成?

版本说明:

PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
Run Code Online (Sandbox Code Playgroud)

从从节点,在/var/log/postgresql/postgresql-9.4-main.log我可以看到:

2017-07-18 19:43:55 UTC [12816-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:43:55 UTC [12816-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:00 UTC [12817-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:00 UTC [12817-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:05 UTC [12821-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:05 UTC [12821-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:10 UTC [12825-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:10 UTC [12825-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:15 UTC [12826-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:15 UTC [12826-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed
Run Code Online (Sandbox Code Playgroud)

新问题:我如何才能看到实际问题出现的位置?

大师postgresql.confhttps : //pastebin.com/NJX5ku6m

奴隶postgresql.confhttps : //pastebin.com/CUZcyazC

奴隶recovery.conf

standby_mode = on
primary_conninfo = 'host=10.1.1.65 port=5432 user=replicador password=replicador'
Run Code Online (Sandbox Code Playgroud)

Cra*_*ger 6

基于此,我会说您wal_keep_segments在主服务器上没有足够的资源,没有使用复制槽,并且hot_standby_feedback连接断开或连接断开的时间足够长,以便主服务器删除所需的 WAL。

而且您可能没有使用 WAL 归档(archive_command在主服务器上,restore_command在副本上)作为后备。

因此,主删除事务记录所需的备用。

您需要重新创建备用数据库。然后:

  • 将备用数据库设置为使用复制槽并启用hot_standby_feedback;或者

  • 启用archive_commandrestore_command