为什么在 AWS aurora 中使用 pglogic 时订阅状态会关闭?

Zha*_* Yi 7 postgresql amazon-web-services amazon-aurora

我在 AWS 中部署了两个 Aurora postgresql 集群 11.9 和 13.4。我按照指令https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pgological-extension/从集群设置副本11 到集群 13。

两个集群中的节点、replica_set 和订阅均已成功创建。但是,当我检查目标数据库中的订阅状态时,状态为down:

=> select subscription_name, slot_name, status from pglogical.show_subscription_status();
 subscription_name |                   slot_name                   | status
-------------------+-----------------------------------------------+--------
 ams_subscription1 | pgl_____ngine_ams_mast2d01c59_ams_subsc050888 | down
(1 row)
Run Code Online (Sandbox Code Playgroud)

在目标数据库日志中,我可以看到以下错误:


2022-03-01 06:34:29 UTC::@:[16329]:LOG: background worker "pglogical apply 16400:3226503298" (PID 29403) exited with exit code 1
2022-03-01 06:35:49 UTC::@:[16329]:LOG: background worker "pglogical apply 16400:3226503298" (PID 1453) exited with exit code 1
2022-03-01 06:38:29 UTC::@:[16329]:LOG: background worker "pglogical apply 16400:3226503298" (PID 10318) exited with exit code 1
----------------------- END OF LOG ----------------------
Run Code Online (Sandbox Code Playgroud)

在源数据库日志中:

2022-03-01 06:34:29 UTC:10.74.105.225(33688):amsMasterUser@AMSEngine:[26786]:ERROR: replication origin "pgl_____ngine_ams_mast2d01c59_ams_subsc050888" does not exist
2022-03-01 06:34:29 UTC:10.74.105.225(33688):amsMasterUser@AMSEngine:[26786]:STATEMENT: SELECT pg_catalog.pg_replication_origin_session_setup('pgl_____ngine_ams_mast2d01c59_ams_subsc050888');
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SET session_replication_role = 'replica';
SET DATESTYLE = ISO;
SET INTERVALSTYLE = POSTGRES;
SET extra_float_digits TO 3;
SET statement_timeout = 0;
SET lock_timeout = 0;

2022-03-01 06:34:29 UTC:10.74.105.225(33688):amsMasterUser@AMSEngine:[26786]:LOG: could not receive data from client: Connection reset by peer
2022-03-01 06:34:29 UTC:10.74.105.225(33684):amsMasterUser@AMSEngine:[26784]:LOG: could not receive data from client: Connection reset by peer
2022-03-01 06:34:29 UTC:10.74.105.225(33684):amsMasterUser@AMSEngine:[26784]:LOG: unexpected EOF on client connection with an open transaction
2022-03-01 06:34:29 UTC:10.74.105.225(33686):amsMasterUser@AMSEngine:[26785]:LOG: could not receive data from client: Connection reset by peer
2022-03-01 06:34:29 UTC:10.74.105.225(33686):amsMasterUser@AMSEngine:[26785]:LOG: unexpected EOF on client connection with an open transaction
----------------------- END OF LOG ----------------------
Run Code Online (Sandbox Code Playgroud)

我可以看到源实例和目标实例都有错误。但我无法弄清楚可能是什么问题。

SRJ*_*SRJ 0

根本原因

这是一种非常奇怪的状态,不知道如何克服它并了解根本原因,但我已经设法通过以下步骤解决了这个问题。

对于我们来说,原因是关闭源数据库和目标数据库,这在某种程度上导致了这种不同步状态。

注意:如果您的订阅状态为replicating源/目标不同步,则只需运行#additional note 中提到的命令,无需应用解决方案。

解决

如果您的订阅状态是down

  • 删除 pglogic 扩展
  • 再次为您的发布者和订阅者数据库创建 pglogic 配置。

附加说明

最后一步将使订阅恢复到replicating新条目的良好状态,但滞后可能仍然存在。

因此,为了确保源表和目标表同步,请运行以下命令

SELECT pglogical.alter_subscription_resynchronize_table('${SUBSCRIPTION_NAME}', '${SCHEMA_NAME}.${TABLE_NAME}');
Run Code Online (Sandbox Code Playgroud)

几分钟后,两个表的数据就同步了。