RAID 1 中的第二个驱动器不断出现故障

Saf*_*ado 0 linux raid hard-drive mdadm software-raid

我这里有点问题。我有一个 Ubuntu Linux 服务器,在软件 RAID 1(使用 mdadm 创建)中设置了 2 个 SAS 驱动器。RAID 可以正常运行一天,我可以执行 cat /proc/mdstat 并且它显示两个磁盘都处于活动状态并且一切正常。然后出乎意料的是,第二个磁盘将出现故障,并进入降级模式。

然后我将从 RAID 组中移除磁盘,重新启动服务器,然后将磁盘重新添加到组中。RAID 将自行重建而不会出现任何问题,我将拥有一个健康的 RAID 1,可以使用相同的磁盘再次运行。然后,在 12-24 小时左右的时间内,第二个驱动器将出现故障。

硬盘是全新的,所以我认为硬件没问题。这是我在磁盘出现故障时能够从 kern.log 和 syslog 中捕获的输出。

任何人都可以翻译这个或知道可能会发生什么吗?

谢谢!

内核日志

 Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.180815] sd 2:0:0:0: Attached scsi generic sg1 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.181086] sd 2:0:1:0: Attached scsi generic sg2 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.181376] sd 2:0:1:0: [sdb] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182584] sd 2:0:1:0: [sdb] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182591] sd 2:0:1:0: [sdb] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182835] sd 2:0:0:0: [sda] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.183802] sd 2:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.185146] sd 2:0:0:0: [sda] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.185151] sd 2:0:0:0: [sda] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.188191] sd 2:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.191403] sd 2:0:1:0: [sdb] Attached SCSI disk
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.299351] sd 2:0:0:0: [sda] Attached SCSI disk
Mar  1 09:01:22 CSTEP-APPS20 kernel: [44807.010040] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:01:32 CSTEP-APPS20 kernel: [44817.560056] sd 2:0:1:0: [sdb] CDB: Test Unit Ready: 00 00 00 00 00 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
Run Code Online (Sandbox Code Playgroud)

和系统日志

Mar  1 09:01:43 CSTEP-APPS20 kernel: [44827.860060] mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!!
Mar  1 09:01:43 CSTEP-APPS20 kernel: [44827.860070] mptbase: ioc0: Initiating recovery
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470023] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470030] mptscsih: ioc0: attempting task abort! (sc=ffff880156fa4c00)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470050] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880156fa4c00)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470073] scsi target2:0:0: Beginning Domain Validation
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720120] mptscsih: ioc0: attempting target reset! (sc=ffff88016197b400)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.262008] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512073] mptscsih: ioc0: attempting bus reset! (sc=ffff88016197b400)
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:05 CSTEP-APPS20 kernel: [44850.046491] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:15 CSTEP-APPS20 kernel: [44860.553909] mptscsih: ioc0: attempting host reset! (sc=ffff88016197b400)
Mar  1 09:02:15 CSTEP-APPS20 kernel: [44860.553915] mptbase: ioc0: Initiating recovery
Mar  1 09:02:35 CSTEP-APPS20 kernel: [44879.870026] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380429] end_request: I/O error, dev sdb, sector 55297928
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380562] __ratelimit: 24 callbacks suppressed
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380566] raid1: sdb1: rescheduling sector 55295880
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380695] end_request: I/O error, dev sdb, sector 55297984
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380817] raid1: sdb1: rescheduling sector 55295936
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381019] end_request: I/O error, dev sdb, sector 63983488
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381142] md: super_written gets error=-5, uptodate=0
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381146] raid1: Disk failure on sdb1, disabling device.
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381148] raid1: Operation continuing on 1 devices.
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398144] scsi target2:0:0: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398226] scsi target2:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU RTI WRFLOW PCOMP (6.25 ns, offset 127)
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398295] scsi target2:0:1: Beginning Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648493] scsi target2:0:1: Domain Validation Initial Inquiry Failed
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648623] scsi target2:0:1: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648691] scsi target2:0:1: asynchronous
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648760] scsi target2:0:8: Beginning Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.649386] scsi target2:0:8: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.649458] scsi target2:0:8: asynchronous
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653384] RAID1 conf printout:
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653390]  --- wd:1 rd:2
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653395]  disk 0, wo:0, o:1, dev:sda1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653399]  disk 1, wo:1, o:0, dev:sdb1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693763] RAID1 conf printout:
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693767]  --- wd:1 rd:2
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693771]  disk 0, wo:0, o:1, dev:sda1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.714266] raid1: sda1: redirecting sector 55295880 to another mirror
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.719943] raid1: sda1: redirecting sector 55295936 to another mirror
Run Code Online (Sandbox Code Playgroud)

Eva*_*son 5

看起来设备 /dev/sdb 即将脱机。您可能有布线问题,但也很可能是磁盘问题。当然也可能与磁盘固件和控制器发生冲突。

我会立即在磁盘上运行制造商的诊断程序。仅仅因为它们是全新的,我不会怀疑它们有缺陷。(事实上​​,作为全新的,我怀疑它们比已经运行了几个月的磁盘要多一点。)

  • +1 即使对驱动器运行坏块也可能会发现故障,您应该能够在服务器在线时运行坏块。 (2认同)