在 HDD 崩溃后启动 PostgreSQL 服务器导致 FAILED STATE

Question

在 HDD 崩溃后启动 PostgreSQL 服务器导致 FAILED STATE

我使用Fedora 15带PostgreSQL 9.1.4。Fedora 最近崩溃了，之后：

尝试启动 PostgreSQL 服务器：

service postgresql-9.1 start

Run Code Online (Sandbox Code Playgroud)

给

Starting postgresql-9.1 (via systemctl):  Job failed. See system logs and 'systemctl status' for details.
                                                       [FAILED]

Run Code Online (Sandbox Code Playgroud)

虽然，当我在系统重新启动后第一次启动服务器时，服务器正常启动。
但是，尝试使用psql会出现此错误：

psql: could not connect to server: No such file or directory
    Is the server running locally and accepting
    connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

Run Code Online (Sandbox Code Playgroud)

.s.PGSQL.5432文件不在系统的任何地方。Alocate .s.PGSQL.5432什么都不输出。

系统日志是这样的：

Aug 14 17:31:58 localhost systemd[1]: postgresql-9.1.service: control process exited, code=exited status=1
Aug 14 17:31:58 localhost systemd[1]: Unit postgresql-9.1.service entered failed state.

Run Code Online (Sandbox Code Playgroud)

一种

systemctl status postgresql-9.1.service

Run Code Online (Sandbox Code Playgroud)

给

postgresql-9.1.service - SYSV: PostgreSQL database server.
          Loaded: loaded (/etc/rc.d/init.d/postgresql-9.1)
      Active: failed since Tue, 14 Aug 2012 17:31:58 +0530; 58s ago
     Process: 2811 ExecStop=/etc/rc.d/init.d/postgresql-9.1 stop (code=exited, status=1/FAILURE)
     Process: 12423 ExecStart=/etc/rc.d/init.d/postgresql-9.1 start (code=exited, status=1/FAILURE)
    Main PID: 2551 (code=exited, status=1/FAILURE)
      CGroup: name=systemd:/system/postgresql-9.1.service

Run Code Online (Sandbox Code Playgroud)

我没有更改 fsync 的默认设置，所以我猜它被设置为on. 我在硬盘上。硬盘坏了。

硬盘崩溃

硬盘崩溃导致在fsck提示下运行手册而不是基于 gui。有了它修复极大数的inode等。之后，我重新启动系统，具有Ctrl+ Alt+ Delete。

PostgreSQL 的日志是这样的：

LOG:  database system was interrupted; last known up at 2012-08-14 17:31:57 IST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  record with zero length at 0/41A4E58
LOG:  redo is not required
FATAL:  could not access status of transaction 1
DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.
LOG:  startup process (PID 13016) exited with exit code 1
LOG:  aborting startup due to startup process failure

Run Code Online (Sandbox Code Playgroud)

更新

在获取目录的文件系统级副本后尝试启动服务器/var/lib/pgsql，并运行./pg_resetxlog -f /var/lib/pgsql/9.1/data/结果xlog -f /var/lib/pgsql/9.1/data/仍然产生：

LOG:  database system was interrupted; last known up at 2012-08-14 18:46:36 IST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  record with zero length at 0/6000078
LOG:  redo is not required
FATAL:  could not access status of transaction 1
DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or directory.
LOG:  startup process (PID 13766) exited with exit code 1
LOG:  aborting startup due to startup process failure

Run Code Online (Sandbox Code Playgroud)

Answer 1

Cra*_*ger 15

真正的答案将在 PostgreSQL 日志中，在/var/lib/pgsql/data/pg_log.

但是，在您采取任何行动之前：如果您的任何数据对您有价值，那么在尝试修复之前获取数据库的文件系统级副本至关重要。请参阅http://wiki.postgresql.org/wiki/Corruption。您必须复制整个数据目录。在 Fedora 上这是/var/lib/pgsql/data默认设置，但请验证这对您的安装是否正确。

根据您发布的日志，您肯定有一定程度的数据库损坏。数据库所在的存储（硬盘驱动器或文件系统）很可能已损坏。立即获取副本，并将其放在不同的硬盘驱动器或系统上。

只有在您制作了数据目录的完整文件系统级副本后，才能尝试使用pg_resetxlog清除损坏的事务日志并启动您的数据库。即使启动，也极有可能腐败；你应该pg_dump然后重新initdb它并将转储恢复到新实例。

如果您在之后仍然无法启动它，pg_resetxlog则在 resetxlog 之后发布启动尝试的更新日志。您可能需要以独立模式启动 Pg：

sudo -u postgres postgres --single -D /var/lib/pgsql/data -P -f i postgres

Run Code Online (Sandbox Code Playgroud)

如果可行，给你一个backend>提示，在用你想要连接的数据库的名称替换最后一个“postgres”后重试。你应该能够SELECT，COPY从表中的数据，等等。

如果这不起作用，即您无法启动独立的后端，那么可能是时候从备份中恢复了 - 因为您足够明智地拥有它们。如果阅读本文的其他人处于同一位置，请联系经验丰富的 PostgreSQL 顾问，看看他们是否可以从您的数据库中恢复数据。准备好为他们的时间和专业知识付费。

您的文件系统可能已损坏

PostgreSQL 安装损坏的严重性表明您的整个文件系统可能已损坏。您可能希望考虑从备份还原整个系统或重新安装它。

我不会相信这个文件系统，fsck或者不相信fsck.

智能测试您的驱动器

我还建议您SMART使用smartctlfrom smartmontools对您的硬盘驱动器进行检查；假设/dev/hda就是这样smartctl -d ata -a /dev/sda | less。查找失败的运行状况测试、uncorrectable_sectors高读取错误率、超过 2 或 3 的重新分配扇区数，或非零的 current_pending_sector。运行smartctl -d ata -t long /dev/sda以在您的硬盘上执行无损自检；它不会中断系统的正常运行。当估计的时间过去后，smartctl -d ata /dev/sda再次运行并查看自检日志以查看它是否通过。

如果任何东西看起来不完美，请更换驱动器。

将来，考虑通过smartd早期警告驱动器故障来自动执行此测试。

（这篇文章中的内容因问题的更新而过时。如果您正在对类似问题进行故障排除，请查看此答案的编辑历史记录）。

归档时间：	13 年，2 月前
查看次数：	12091 次
最近记录：	13 年，2 月前