降级的 zfs 池与故障的

Question

降级的 zfs 池与故障的

我的备份 NAS（基于 Arch）报告池性能降级。它还将降级磁盘报告为“正在修复”。我对此感到困惑。假设有缺陷比退化更糟糕，我应该担心吗？

zpool状态-v：

  pool: zdata
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Dec 16 11:35:37 2019
        1.80T scanned at 438M/s, 996G issued at 73.7M/s, 2.22T total
        1.21M repaired, 43.86% done, 0 days 04:55:13 to go
config:

        NAME                            STATE     READ WRITE CKSUM
        zdata                           DEGRADED     0     0     0
          wwn-0x50014ee0019b83a6-part1  ONLINE       0     0     0
          wwn-0x50014ee057084591-part1  ONLINE       0     0     0
          wwn-0x50014ee0ac59cb99-part1  DEGRADED   224     0   454  too many errors  (repairing)
          wwn-0x50014ee2b3f6d328-part1  ONLINE       0     0     0
        logs
          wwn-0x50000f0056424431-part5  ONLINE       0     0     0
        cache
          wwn-0x50000f0056424431-part4  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        zdata/backup:<0x86697>

Run Code Online (Sandbox Code Playgroud)

此外，报告的故障磁盘要小得多：zpool iostat -v：

                                  capacity     operations     bandwidth
pool                            alloc   free   read  write   read  write
------------------------------  -----  -----  -----  -----  -----  -----
zdata                           2.22T  1.41T     33     34  31.3M  78.9K
  wwn-0x50014ee0019b83a6-part1   711G   217G     11      8  10.8M  18.0K
  wwn-0x50014ee057084591-part1   711G   217G     10     11  9.73M  24.6K
  wwn-0x50014ee0ac59cb99-part1   103G   825G      0     10      0  29.1K
  wwn-0x50014ee2b3f6d328-part1   744G   184G     11      2  10.7M  4.49K
logs                                -      -      -      -      -      -
  wwn-0x50000f0056424431-part5     4K   112M      0      0      0      0
cache                               -      -      -      -      -      -
  wwn-0x50000f0056424431-part4  94.9M  30.9G      0      1      0   128K
------------------------------  -----  -----  -----  -----  -----  -----

Run Code Online (Sandbox Code Playgroud)

[编辑] 由于硬盘不断报告错误，我决定用备用硬盘替换它。首先，我为新磁盘发出了添加备用命令，该磁盘包含在池中，然后我发出了替换命令，用备用磁盘替换降级的磁盘。它可能不会改善事情，因为池现在显示：

  pool: zdata
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Dec 22 10:20:20 2019
        36.5G scanned at 33.2M/s, 27.4G issued at 24.9M/s, 2.21T total
        0B resilvered, 1.21% done, 1 days 01:35:59 to go
config:

        NAME                              STATE     READ WRITE CKSUM
        zdata                             DEGRADED     0     0     0
          wwn-0x50014ee0019b83a6-part1    ONLINE       0     0     0
          wwn-0x50014ee057084591-part1    ONLINE       0     0     0
          spare-2                         DEGRADED     0     0     0
            wwn-0x50014ee0ac59cb99-part1  DEGRADED     0     0     0  too many errors
            wwn-0x50014ee25ea101ef        ONLINE       0     0     0
          wwn-0x50014ee2b3f6d328-part1    ONLINE       0     0     0
        logs
          wwn-0x50000f0056424431-part5    ONLINE       0     0     0
        cache
          wwn-0x50000f0056424431-part4    ONLINE       0     0     0
        spares
          wwn-0x50014ee25ea101ef          INUSE     currently in use

errors: No known data errors

Run Code Online (Sandbox Code Playgroud)

让我担心的是“出发”日期不断增加（！）。在我写这篇文章时，它现在显示为 1 天 05:40:10。我假设当另一个磁盘、控制器或电源出现故障时，池将永远丢失。

[编辑] 新驱动器在 4 小时左右后重新同步。ZFS的估计显然不太正确。卸下故障驱动器后，我现在遇到的情况是，新驱动器显示 1TB 磁盘仅使用了 103G。就像降级驱动器一样。我如何才能达到完整的 1TB？

Answer 1

sho*_*hok 7

一般来说，降级磁盘的状态比故障磁盘的状态要好。

来自zpool 手册页（稍微重新格式化）：

降级：校验和错误的数量超过可接受的水平，设备降级，表明可能出现问题。ZFS 根据需要继续使用该设备

FAILED： I/O 错误数量超出可接受的水平，设备出现故障，无法进一步使用该设备

在您的具体情况下，scrub在一个磁盘上发现许多读取和校验和错误，ZFS 开始修复受影响的磁盘。与此同时，ZED（ZFS 事件守护进程）注意到校验和错误的爆发并降低了磁盘的性能以避免使用/对其施加压力。

擦洗结束后，我建议您到zpool clear游泳池再进行一次 zfs scrub。如果第二次清理没有发现错误，您可以继续使用该池，但是考虑到当前清理中出现了多少错误，我会尽快更换磁盘。

如果您有充分的理由相信磁盘本身没有故障，则应该分析dmesg并smartctl --all输出以找出根本错误原因。举个例子：我有一个磁盘本身很好，但由于电源/电缆的噪音而产生了许多实际错误。

无论如何，黄金法则始终适用：请务必对池数据进行最新备份。

归档时间：	5 年，9 月前
查看次数：	14257 次
最近记录：	2 年，7 月前