在启动时从 WAL 错误中恢复 PostgreSQL 数据库？

Question

在启动时从 WAL 错误中恢复 PostgreSQL 数据库？

Edw*_*ard 5 postgresql ubuntu corruption

我正在尝试使用switch2osm.org 上列出的 Ubuntu 包在 Ubuntu 12.04 机器上设置 OpenStreetMap 服务器。我最初使用仅限美国东北部的地图提取物安装并设置了所有内容，但现在我想安装整个地图星球。我下载了planet-latest.osm.bz2并osm2pgsql --slim -C 60000 planet-latest.osm.bz2以对数据库具有写权限的用户身份运行；这与之前安装 us-northeast.osm.pbf 的命令相同。第二天我回来发现这个命令似乎成功完成，但由于某种原因渲染守护进程没有从新数据生成新的图块。我尝试重新启动渲染，当没有效果时，我尝试使用sudo /etc/init.d/postgresql restart. 但是，服务器启动失败，日志中出现以下错误：

2012-07-13 18:54:59 UTC WARNING:  page 1525147 of relation base/16385/477861 was uninitialized
2012-07-13 18:54:59 UTC WARNING:  page 2247965 of relation base/16385/477861 was uninitialized
...500 more lines like this...
2012-07-13 18:54:59 UTC WARNING:  page 2262926 of relation base/16385/477861 was uninitialized
2012-07-13 18:54:59 UTC PANIC:  WAL contains references to invalid pages
2012-07-13 18:55:00 UTC LOG:  startup process (PID 22826) was terminated by signal 6: Aborted

Run Code Online (Sandbox Code Playgroud)

（此处为整个日志的 Pastebin ）。

互联网上关于此类错误的信息不多，但从我所能找到的情况来看，这似乎意味着我的索引已损坏或我的预写日志已损坏。但是，修复损坏索引的唯一方法是以单用户模式启动数据库并重建它们，我什至不能这样做，因为即使我以单用户模式启动并带有索引，我也会遇到相同的致命错误禁用。

有什么方法可以让我删除预写日志并强制服务器“从头开始”启动，或者修复这种不需要首先成功启动数据库的损坏？

或者，鉴于我无法启动服务器来执行 DROP DATABASE 命令，有没有办法让我删除数据库并重新导入所有行星数据？

更新：

按照 Craig Ringer 的建议，我查看了 WAL 错误开始发生之前的数据库日志，以查看是否可以找到任何可疑行为。在第一个 WAL 错误实例之前的日志中，我发现了这些可疑的行：

2012-07-13 00:20:51 UTC LOG:  received fast shutdown request
2012-07-13 00:20:51 UTC LOG:  aborting any active transactions
2012-07-13 00:20:51 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:51 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:51 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:51 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:54 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:54 UTC STATEMENT:  CREATE TABLE planet_osm_polygon_tmp AS SELECT * 
FROM planet_osm_polygon ORDER BY way;

2012-07-13 00:20:55 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:55 UTC STATEMENT:  CREATE INDEX planet_osm_ways_nodes ON planet_osm_ways 
USING gin (nodes)  WITH (FASTUPDATE=OFF);

2012-07-13 00:20:57 UTC FATAL:  terminating connection due to administrator command
2012-07-13 00:20:57 UTC STATEMENT:  CREATE TABLE planet_osm_line_tmp AS SELECT * 
FROM planet_osm_line ORDER BY way;

2012-07-13 00:21:51 UTC LOG:  received immediate shutdown request
2012-07-13 00:21:52 UTC WARNING:  terminating connection because of crash of another
server process
2012-07-13 00:21:52 UTC DETAIL:  The postmaster has commanded this server process 
to roll back the current transaction and exit, because another server process 
exited abnormally and possibly corrupted shared memory.
2012-07-13 00:21:52 UTC HINT:  In a moment you should be able to reconnect to the 
database and repeat your command.
2012-07-13 00:21:52 UTC LOG:  could not send data to client: Broken pipe
2012-07-13 00:21:58 UTC WARNING:  terminating connection because of crash of another 
server process
2012-07-13 00:21:58 UTC DETAIL:  The postmaster has commanded this server process 
to roll back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
2012-07-13 00:21:58 UTC HINT:  In a moment you should be able to reconnect to the
 database and repeat your command.
2012-07-13 00:21:58 UTC LOG:  could not send data to client: Broken pipe

Run Code Online (Sandbox Code Playgroud)

（整个日志的Pastebin在这里）

当它说“由于管理员命令而终止连接”时，我认为这是我重新启动数据库服务器的命令。但看起来关机以某种方式失败了，导致共享内存损坏。这没有意义，因为我使用/etc/init.d/postgres restart脚本“干净地”重新启动了它，而不是突然终止或手动登录为postgres. 我是否错误地解释了此日志？或者使用/etc/init.d/postgres restart重新启动PostgreSQL服务器实际上有问题吗？

（请注意，由于我的问题已移至数据库管理员，在那里我是“新用户”，因此我不再能够为您的答案投票。这并不意味着我不感谢您的帮助）。

Answer 1

Cra*_*ger 5

更新：看起来这是 PostgreSQL 的 Debian/Ubuntu 包装中的一个错误，其中 init 脚本 - 非常不安全 - kill -9postmaster 和 remove postmaster.pid. 请参阅pgsql-general 上的这篇文章。

看：

就我个人而言，我已经编辑了我的 init 脚本以摆脱这个相当多毛和危险的代码。

原来的答案

请在重新启动之前返回日志，看看是否可以找到任何错误。WAL 损坏绝对不应该发生，所以如果有的话，调查原因很重要。如果您可以将整个日志的副本上传到粘贴箱或其他非常方便的东西。

唯一一次 WAL 损坏是 PostgreSQL 可接受的可能性是，如果您使用 PostgreSQL.conf 中的fsync=offset运行并且您的系统崩溃或意外断电。如果这不是原因，最好调查一下发生了什么。

请不要不使用pg_resetxlog没有一些想法，为什么您xlogs被损坏。如果事务日志损坏，则出现严重错误，您需要找出原因。如果你现在用创可贴，以后当你关心数据时，你可能会被它咬伤。

事务日志的存在是有原因的，只是删除它们会使您的表和索引处于不一致、损坏的状态。之后，删除集群、重新初始化数据库并重新加载数据库pg_resetxlog是一个非常好的主意pg_dumpall。但是，正如我所说，这不应该发生，您应该在日志中查看可能发生的事情的线索。

归档时间：	13 年，3 月前
查看次数：	3921 次
最近记录：	13 年，2 月前

在启动时从 WAL 错误中恢复 PostgreSQL 数据库？

原来的答案

现在阅读评论