Postgres 陷入“致命:数据库系统正在启动”长达数小时之久。我们应该等待还是这是数据库损坏的迹象?

Ale*_*lex 8 postgresql

我找到了这个问题的答案,表明重新启动 postgres 可能需要一段时间。但现在我们已经等了2个多小时,与通常需要的5分钟相比,这显得非常漫长。(/sf/ask/3844570341/,/sf/ask/3844570341/
​-数据库系统正在启动-windows-10 )

除了 之外没有其他日志FATAL: the database system is starting up。此外,CPU 和 RAM 利用率极低。看起来好像什么也没发生。

我们还检查了磁盘空间,没有问题。

我们应该等多久才能尝试其他事情?您知道什么可能导致此问题以及如何解决它吗?

一些附加信息:

  • Postgres 版本是 13.8
  • 数据库大小约为 1.5TB
  • 如果我们的监控工作正常,最后一个检查点是在关闭前 20 分钟左右
  • 沃尔设置:
max_wal_size = 8GB
min_wal_size = 512MB
checkpoint_completion_target = 0.9
Run Code Online (Sandbox Code Playgroud)

使用该-d 3标志启动 postgres 后,我们得到以下日志:

 2022-12-02 15:17:10.471 UTC [1] LOG:  starting PostgreSQL 13.8 (Ubuntu 13.8-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, 64-bit
 2022-12-02 15:17:10.472 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
 2022-12-02 15:17:10.472 UTC [1] LOG:  listening on IPv6 address "::", port 5432
 2022-12-02 15:17:10.480 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
 2022-12-02 15:17:10.683 UTC [14] LOG:  database system was interrupted; last known up at 2022-12-02 10:01:04 UTC
 2022-12-02 15:17:10.683 UTC [14] DEBUG:  removing all temporary WAL segments
 2022-12-02 15:17:10.914 UTC [1] DEBUG:  forked new backend, pid=15 socket=9
 2022-12-02 15:17:10.915 UTC [15] LOG:  connection received: host=10.255.0.181 port=46576
 2022-12-02 15:17:10.918 UTC [15] FATAL:  the database system is starting up
 2022-12-02 15:17:10.918 UTC [15] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
 2022-12-02 15:17:10.918 UTC [15] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
 2022-12-02 15:17:10.918 UTC [15] DEBUG:  proc_exit(1): 1 callbacks to make
 2022-12-02 15:17:10.918 UTC [15] DEBUG:  exit(1)
 2022-12-02 15:17:10.918 UTC [15] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
 2022-12-02 15:17:10.918 UTC [15] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
 2022-12-02 15:17:10.918 UTC [15] DEBUG:  proc_exit(-1): 0 callbacks to make
 2022-12-02 15:17:10.920 UTC [1] DEBUG:  server process (PID 15) exited with exit code 1
 2022-12-02 15:17:11.157 UTC [1] DEBUG:  forked new backend, pid=16 socket=9
 2022-12-02 15:17:11.157 UTC [16] LOG:  connection received: host=10.255.0.181 port=56142
 2022-12-02 15:17:11.158 UTC [16] FATAL:  the database system is starting up
 2022-12-02 15:17:11.158 UTC [16] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
 2022-12-02 15:17:11.158 UTC [16] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
 2022-12-02 15:17:11.158 UTC [16] DEBUG:  proc_exit(1): 1 callbacks to make
 2022-12-02 15:17:11.158 UTC [16] DEBUG:  exit(1)
 2022-12-02 15:17:11.158 UTC [16] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
 2022-12-02 15:17:11.158 UTC [16] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
 2022-12-02 15:17:11.158 UTC [16] DEBUG:  proc_exit(-1): 0 callbacks to make
 2022-12-02 15:17:11.160 UTC [1] DEBUG:  server process (PID 16) exited with exit code 1
Run Code Online (Sandbox Code Playgroud)

这会无限地重复。

按照 Laurenz 的建议使用 strace 给了我们这个:

[02.12.2022, 5:25:50,463 PM] openat(AT_FDCWD, "/usr/share/locale/C.UTF-8/LC_MESSAGES/pg_ctl-13.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
[02.12.2022, 5:25:50,463 PM] openat(AT_FDCWD, "/usr/share/locale/C.utf8/LC_MESSAGES/pg_ctl-13.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
[02.12.2022, 5:25:50,463 PM] openat(AT_FDCWD, "/usr/share/locale/C/LC_MESSAGES/pg_ctl-13.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
[02.12.2022, 5:25:50,463 PM] openat(AT_FDCWD, "/usr/share/locale-langpack/C.UTF-8/LC_MESSAGES/pg_ctl-13.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
[02.12.2022, 5:25:50,463 PM] openat(AT_FDCWD, "/usr/share/locale-langpack/C.utf8/LC_MESSAGES/pg_ctl-13.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
[02.12.2022, 5:25:50,463 PM] openat(AT_FDCWD, "/usr/share/locale-langpack/C/LC_MESSAGES/pg_ctl-13.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
[02.12.2022, 5:25:50,464 PM] newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:50,465 PM] write(1, "waiting for server to start...", 30waiting for server to start...) = 30
[02.12.2022, 5:25:50,466 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = -1 ENOENT (No such file or directory)
[02.12.2022, 5:25:50,466 PM] wait4(63, 0x7ffcc7ccd680, WNOHANG, NULL) = 0
[02.12.2022, 5:25:50,466 PM] write(1, ".", 1.)                        = 1
[02.12.2022, 5:25:50,466 PM] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=100000000}, NULL) = 0 (Timeout)
[02.12.2022, 5:25:50,515 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = 3
[02.12.2022, 5:25:50,515 PM] newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=68, ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:50,516 PM] read(3, "63\n/home/postgres/pgdata/data\n16"..., 69) = 68
[02.12.2022, 5:25:50,517 PM] close(3)                                = 0
[02.12.2022, 5:25:50,518 PM] wait4(63, 0x7ffcc7ccd680, WNOHANG, NULL) = 0
[02.12.2022, 5:25:50,519 PM] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=100000000}, NULL2022-12-02 16:25:50.604 UTC [63] LOG:  starting PostgreSQL 13.8 (Ubuntu 13.8-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, 64-bit
[02.12.2022, 5:25:50,540 PM] 2022-12-02 16:25:50.612 UTC [63] LOG:  listening on IPv4 address "0.0.0.0", port 5432
[02.12.2022, 5:25:50,540 PM] 2022-12-02 16:25:50.612 UTC [63] LOG:  listening on IPv6 address "::", port 5432
[02.12.2022, 5:25:50,546 PM] 2022-12-02 16:25:50.617 UTC [63] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
[02.12.2022, 5:25:50,619 PM] ) = 0 (Timeout)
[02.12.2022, 5:25:50,619 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = 3
[02.12.2022, 5:25:50,619 PM] newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=88, ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:50,620 PM] read(3, "63\n/home/postgres/pgdata/data\n16"..., 89) = 88
[02.12.2022, 5:25:50,621 PM] close(3)                                = 0
[02.12.2022, 5:25:50,621 PM] wait4(63, 0x7ffcc7ccd680, WNOHANG, NULL) = 0
[02.12.2022, 5:25:50,621 PM] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=100000000}, NULL) = 0 (Timeout)
[02.12.2022, 5:25:50,716 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = 3
[02.12.2022, 5:25:50,716 PM] newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=88, ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:50,716 PM] read(3, "63\n/home/postgres/pgdata/data\n16"..., 89) = 88
[02.12.2022, 5:25:50,716 PM] close(3)                                = 0
[02.12.2022, 5:25:50,717 PM] wait4(63, 0x7ffcc7ccd680, WNOHANG, NULL) = 0
[02.12.2022, 5:25:50,717 PM] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=100000000}, NULL) = 0 (Timeout)
[02.12.2022, 5:25:50,817 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = 3
[02.12.2022, 5:25:50,819 PM] newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=88, ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:50,820 PM] read(3, "63\n/home/postgres/pgdata/data\n16"..., 89) = 88
[02.12.2022, 5:25:50,824 PM] close(3)                                = 0
[02.12.2022, 5:25:50,824 PM] wait4(63, 0x7ffcc7ccd680, WNOHANG, NULL) = 0
[02.12.2022, 5:25:50,824 PM] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=100000000}, NULL) = 0 (Timeout)
[02.12.2022, 5:25:50,918 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = 3
[02.12.2022, 5:25:50,919 PM] newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=88, ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:50,919 PM] read(3, "63\n/home/postgres/pgdata/data\n16"..., 89) = 88
[02.12.2022, 5:25:50,919 PM] close(3)                                = 0
[02.12.2022, 5:25:50,919 PM] wait4(63, 0x7ffcc7ccd680, WNOHANG, NULL) = 0
[02.12.2022, 5:25:50,919 PM] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=100000000}, NULL) = 0 (Timeout)
[02.12.2022, 5:25:51,019 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = 3
[02.12.2022, 5:25:51,019 PM] newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=88, ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:51,019 PM] read(3, "63\n/home/postgres/pgdata/data\n16"..., 89) = 88
[02.12.2022, 5:25:51,020 PM] close(3)                                = 0
[02.12.2022, 5:25:51,020 PM] wait4(63, 0x7ffcc7ccd680, WNOHANG, NULL) = 0
[02.12.2022, 5:25:51,020 PM] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=100000000}, NULL) = 0 (Timeout)
[02.12.2022, 5:25:51,120 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = 3
[02.12.2022, 5:25:51,120 PM] newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=88, ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:51,120 PM] read(3, "63\n/home/postgres/pgdata/data\n16"..., 89) = 88
Run Code Online (Sandbox Code Playgroud)

所以这是重复的

[02.12.2022, 5:25:50,919 PM] pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=100000000}, NULL) = 0 (Timeout)
[02.12.2022, 5:25:51,019 PM] openat(AT_FDCWD, "/home/postgres/pgdata/data/postmaster.pid", O_RDONLY) = 3
[02.12.2022, 5:25:51,019 PM] newfstatat(3, "", {st_mode=S_IFREG|0600, st_size=88, ...}, AT_EMPTY_PATH) = 0
[02.12.2022, 5:25:51,019 PM] read(3, "63\n/home/postgres/pgdata/data\n16"..., 89) = 88
[02.12.2022, 5:25:51,020 PM] close(3) 
Run Code Online (Sandbox Code Playgroud)

Ale*_*lex 5

数据库再次运行。我们并不完全确定原因,但这就是我们所做的:

我们发现,当 Postgres 启动时,我们得到了很多

2022-12-02 15:17:10.918 UTC [15] FATAL:  the database system is starting up
Run Code Online (Sandbox Code Playgroud)

这是响应大量尝试访问数据库的客户端请求的错误消息。我们在另一个端口上启动数据库,以便它可以在所有客户端都尝试访问它的情况下启动。由于某种原因,这解决了这个问题(至少我们这么认为)。数据库正常启动,再次关闭并在原始端口上重新启动后,它又恢复运行。

一次可怕的经历...