重新启动后两个池中的 ZFS 磁盘?

Sam*_*tin 2 linux zfs zpool

我对 ZFS 还很陌生,所以也许我读错了。
重新启动我的服务器后,我不得不重新导入我的池,并且在执行某些操作时呈现出以下状态:

# zpool status -v
  pool: data
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub in progress since Sat Mar 11 09:54:23 2017
    18.4M scanned out of 28.9T at 4.61M/s, (scan is slow, no estimated time)
    0 repaired, 0.00% done
config:

        NAME                     STATE     READ WRITE CKSUM
        data                     DEGRADED     0     0     0
          raidz1-0               DEGRADED     0     0     0
            sdd                  ONLINE       0     0     0
            5824186382562856058  FAULTED      0     0     0  was /dev/sdb1
            sde                  ONLINE       0     0     0
          raidz1-1               ONLINE       0     0     0
            sdj                  ONLINE       0     0     0
            sdk                  ONLINE       0     0     0
            sdl                  ONLINE       0     0     0
          raidz1-2               ONLINE       0     0     0
            sdg                  ONLINE       0     0     0
            sdb                  ONLINE       0     0     0
            sdf                  ONLINE       0     0     0
          raidz1-3               ONLINE       0     0     0
            sdc                  ONLINE       0     0     0
            sdh                  ONLINE       0     0     0
            sdi                  ONLINE       0     0     0
Run Code Online (Sandbox Code Playgroud)

引起我注意的是 中的FAULTED音量raidz1-0,但我没有注意到它 /dev/sdb,但/dev/sdb目前正在使用中raidz1-2

因此,我正式导出了池,强制清除标签/dev/sdb并显示以下状态:

# zpool status
  pool: data
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Mar 11 10:01:16 2017
    6.16G scanned out of 28.9T at 263M/s, 32h3m to go
    2.02G resilvered, 0.02% done
config:

        NAME                       STATE     READ WRITE CKSUM
        data                       ONLINE       0     0     0
          raidz1-0                 ONLINE       0     0     0
            sdd                    ONLINE       0     0     0
            replacing-1            UNAVAIL      0     0     0
              5824186382562856058  UNAVAIL      0     0     0  was /dev/sdb1/old
              sdb1                 ONLINE       0     0     0  (resilvering)
            sde                    ONLINE       0     0     0
          raidz1-1                 ONLINE       0     0     0
            sdj                    ONLINE       0     0     0
            sdk                    ONLINE       0     0     0
            sdl                    ONLINE       0     0     0
          raidz1-2                 ONLINE       0     0     0
            sdg                    ONLINE       0     0     0
            16211591403717513484   UNAVAIL      0     0     0  was /dev/sdb1
            sdf                    ONLINE       0     0     0
          raidz1-3                 ONLINE       0     0     0
            sdc                    ONLINE       0     0     0
            sdh                    ONLINE       0     0     0
            sdi                    ONLINE       0     0     0
Run Code Online (Sandbox Code Playgroud)

我有两个问题

  1. 这怎么发生的
  2. 大概这意味着我在某处丢失了磁盘?识别它的最佳方法是什么?

补充说明:这个服务器+boot应该有12个数据盘,但是blkid只有11个。

# blkid
/dev/sda1: UUID="43AB-B900" TYPE="vfat" PARTUUID="70dbeb11-8d0f-4a90-892b-71ddbfa40614"
/dev/sda2: UUID="31b78e1e-47d2-4835-84f3-52526382626e" TYPE="ext2" PARTUUID="d4385b72-1d3b-4f10-b7be-a47240d0a875"
/dev/sda3: UUID="BW2exB-GVBK-2kYB-O6I3-0Xff-tZsT-1wR3eT" TYPE="LVM2_member" PARTUUID="8298e710-3c27-45f8-bde2-0ca014f61560"
/dev/sdc1: LABEL="data" UUID="1497224562158568852" UUID_SUB="8549439230979948204" TYPE="zfs_member" PARTLABEL="zfs-d160a62f672223cd" PARTUUID="9a5815bb-0c8c-4147-81f7-3c2ed819c856"
/dev/sdd1: LABEL="data" UUID="1497224562158568852" UUID_SUB="8670871889276024405" TYPE="zfs_member" PARTLABEL="zfs-056f7c2c0a7e1d0a" PARTUUID="672f59c7-b6b3-604b-8afd-594bd3b9b5f8"
/dev/sde1: LABEL="data" UUID="1497224562158568852" UUID_SUB="6213246766257863816" TYPE="zfs_member" PARTLABEL="zfs-65908045daba9599" PARTUUID="04785f97-1125-7642-b5c1-9c1a16cda925"
/dev/sdf1: LABEL="data" UUID="1497224562158568852" UUID_SUB="8276492610986556289" TYPE="zfs_member" PARTLABEL="zfs-f8318dd36075cff4" PARTUUID="5d7feebf-8a5f-654b-b2d1-c15691800f44"
/dev/sdh1: LABEL="data" UUID="1497224562158568852" UUID_SUB="1281571628149249275" TYPE="zfs_member" PARTLABEL="zfs-59cc747b1125d66a" PARTUUID="61c60d91-9a85-3b4d-9b99-8df071434a50"
/dev/sdg1: LABEL="data" UUID="1497224562158568852" UUID_SUB="10881622467137806147" TYPE="zfs_member" PARTLABEL="zfs-1a80f12f1e668bbe" PARTUUID="208107b9-ad5f-184c-9178-5db0ebf19a14"
/dev/sdi1: LABEL="data" UUID="1497224562158568852" UUID_SUB="17007441084616526595" TYPE="zfs_member" PARTLABEL="zfs-0a8e6dabd469faca" PARTUUID="e8ed04a8-cde2-6244-902e-6353664af06a"
/dev/sdj1: LABEL="data" UUID="1497224562158568852" UUID_SUB="8620535390437895467" TYPE="zfs_member" PARTLABEL="zfs-97d91e998134d363" PARTUUID="a689a2ff-3b07-ef41-8b9c-cf6361a0e1d1"
/dev/sdk1: LABEL="data" UUID="1497224562158568852" UUID_SUB="17779182602415489900" TYPE="zfs_member" PARTLABEL="zfs-52c3d94733668a22" PARTUUID="42a8072a-e94c-a64d-aa07-dee30f675655"
/dev/sdl1: LABEL="data" UUID="1497224562158568852" UUID_SUB="7227713853040895948" TYPE="zfs_member" PARTLABEL="zfs-cc1406096601d13c" PARTUUID="5481683e-1d8b-4342-9629-3c49f6397075"
/dev/mapper/server--vg-root: UUID="1e3fee5d-d4c8-4971-ae32-23722bbd0688" TYPE="ext4"
/dev/mapper/server--vg-swap_1: UUID="6447b120-e79d-4c9f-8cc6-8eef5e275dfc" TYPE="swap"
/dev/sdb1: LABEL="data" UUID="1497224562158568852" UUID_SUB="16704776748125199400" TYPE="zfs_member" PARTLABEL="zfs-368140a1f4980990" PARTUUID="c131befd-a122-aa45-b710-399233eb08a6"
/dev/sdb9: PARTUUID="4aaed8f3-443e-2e44-8737-94a4d09496aa"
/dev/sdc9: PARTUUID="5f2cb2dd-dddd-154f-a771-8db4f5475fec"
/dev/sdd9: PARTUUID="22968880-24bb-d94a-a50f-13adaaa380bc"
/dev/sde9: PARTUUID="b867fa3f-bda4-cf40-b44c-c76bad4047be"
/dev/sdf9: PARTUUID="b4f79585-6676-de40-81ea-44cf74937b28"
/dev/sdh9: PARTUUID="77f5225f-e0e5-4d4f-8361-d13984807960"
/dev/sdg9: PARTUUID="be9746bc-1eb5-9342-b753-3471ae936d42"
/dev/sdi9: PARTUUID="08675893-d6d3-0b49-bf69-105383040006"
/dev/sdj9: PARTUUID="107df9dc-7ea8-694a-8deb-7a6025b74b86"
/dev/sdk9: PARTUUID="2b2ef8de-da71-a740-aad0-bd2dc1d1c8a7"
/dev/sdl9: PARTUUID="f52efda2-f758-2f47-80ff-318be5db3fca"
Run Code Online (Sandbox Code Playgroud)

Tho*_*mas 5

好吧,您丢失了一个属于raidz1-0. /dev/sd[a-m]正如迈克尔·汉普顿 (Michael Hampton) 已经提到的那样,重新启动后设备被重命名。

ZFS 足够聪明,不依赖于/dev/sdx名称,并且可以根据磁盘元数据将池放在一起。在这一点上raidz1-0,由于驱动器故障而降级,这曾经是/dev/sdb在您重新启动服务器之前。重新启动后,故障磁盘导致磁盘重命名,所属磁盘raidz1-2变为/dev/sdb. 由于 ZFS 足够智能,它不在乎,只是将池正确组合在一起。
此时,您应该更换故障磁盘并重新同步raidz1-0池。
相反,您raidz1-2通过删除/dev/sdb真正属于的健康磁盘raidz1-2并将其添加到raidz1-0导致重新同步,从而降低了第二个池的性能。

您应该更换故障磁盘并开始重新同步raidz1-2。重新启动后,这些磁盘很可能会再次重命名。

要识别故障磁盘,请在所有磁盘或卷上进行通信,并查看当您在服务器前时哪个磁盘 LED 没有闪烁。不要忘记在根分区磁盘上进行通信。
一些硬件供应商确实有工具或提供更优雅的方法来识别故障磁盘所在的插槽。