Saf*_*ado 0 linux raid hard-drive mdadm software-raid
我这里有点问题。我有一个 Ubuntu Linux 服务器,在软件 RAID 1(使用 mdadm 创建)中设置了 2 个 SAS 驱动器。RAID 可以正常运行一天,我可以执行 cat /proc/mdstat 并且它显示两个磁盘都处于活动状态并且一切正常。然后出乎意料的是,第二个磁盘将出现故障,并进入降级模式。
然后我将从 RAID 组中移除磁盘,重新启动服务器,然后将磁盘重新添加到组中。RAID 将自行重建而不会出现任何问题,我将拥有一个健康的 RAID 1,可以使用相同的磁盘再次运行。然后,在 12-24 小时左右的时间内,第二个驱动器将出现故障。
硬盘是全新的,所以我认为硬件没问题。这是我在磁盘出现故障时能够从 kern.log 和 syslog 中捕获的输出。
任何人都可以翻译这个或知道可能会发生什么吗?
谢谢!
内核日志
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.180815] sd 2:0:0:0: Attached scsi generic sg1 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.181086] sd 2:0:1:0: Attached scsi generic sg2 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.181376] sd 2:0:1:0: [sdb] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182584] sd 2:0:1:0: [sdb] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182591] sd 2:0:1:0: [sdb] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182835] sd 2:0:0:0: [sda] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.183802] sd 2:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.185146] sd 2:0:0:0: [sda] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.185151] sd 2:0:0:0: [sda] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.188191] sd 2:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.191403] sd 2:0:1:0: [sdb] Attached SCSI disk
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.299351] sd 2:0:0:0: [sda] Attached SCSI disk
Mar 1 09:01:22 CSTEP-APPS20 kernel: [44807.010040] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:01:32 CSTEP-APPS20 kernel: [44817.560056] sd 2:0:1:0: [sdb] CDB: Test Unit Ready: 00 00 00 00 00 00
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
Run Code Online (Sandbox Code Playgroud)
和系统日志
Mar 1 09:01:43 CSTEP-APPS20 kernel: [44827.860060] mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!!
Mar 1 09:01:43 CSTEP-APPS20 kernel: [44827.860070] mptbase: ioc0: Initiating recovery
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470023] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88016197b400)
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470030] mptscsih: ioc0: attempting task abort! (sc=ffff880156fa4c00)
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470050] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880156fa4c00)
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470073] scsi target2:0:0: Beginning Domain Validation
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720120] mptscsih: ioc0: attempting target reset! (sc=ffff88016197b400)
Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.262008] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88016197b400)
Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512073] mptscsih: ioc0: attempting bus reset! (sc=ffff88016197b400)
Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:05 CSTEP-APPS20 kernel: [44850.046491] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88016197b400)
Mar 1 09:02:15 CSTEP-APPS20 kernel: [44860.553909] mptscsih: ioc0: attempting host reset! (sc=ffff88016197b400)
Mar 1 09:02:15 CSTEP-APPS20 kernel: [44860.553915] mptbase: ioc0: Initiating recovery
Mar 1 09:02:35 CSTEP-APPS20 kernel: [44879.870026] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88016197b400)
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380429] end_request: I/O error, dev sdb, sector 55297928
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380562] __ratelimit: 24 callbacks suppressed
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380566] raid1: sdb1: rescheduling sector 55295880
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380695] end_request: I/O error, dev sdb, sector 55297984
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380817] raid1: sdb1: rescheduling sector 55295936
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381019] end_request: I/O error, dev sdb, sector 63983488
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381142] md: super_written gets error=-5, uptodate=0
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381146] raid1: Disk failure on sdb1, disabling device.
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381148] raid1: Operation continuing on 1 devices.
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398144] scsi target2:0:0: Ending Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398226] scsi target2:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU RTI WRFLOW PCOMP (6.25 ns, offset 127)
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398295] scsi target2:0:1: Beginning Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648493] scsi target2:0:1: Domain Validation Initial Inquiry Failed
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648623] scsi target2:0:1: Ending Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648691] scsi target2:0:1: asynchronous
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648760] scsi target2:0:8: Beginning Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.649386] scsi target2:0:8: Ending Domain Validation
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.649458] scsi target2:0:8: asynchronous
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653384] RAID1 conf printout:
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653390] --- wd:1 rd:2
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653395] disk 0, wo:0, o:1, dev:sda1
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653399] disk 1, wo:1, o:0, dev:sdb1
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693763] RAID1 conf printout:
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693767] --- wd:1 rd:2
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693771] disk 0, wo:0, o:1, dev:sda1
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.714266] raid1: sda1: redirecting sector 55295880 to another mirror
Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.719943] raid1: sda1: redirecting sector 55295936 to another mirror
Run Code Online (Sandbox Code Playgroud)
看起来设备 /dev/sdb 即将脱机。您可能有布线问题,但也很可能是磁盘问题。当然也可能与磁盘固件和控制器发生冲突。
我会立即在磁盘上运行制造商的诊断程序。仅仅因为它们是全新的,我不会怀疑它们有缺陷。(事实上,作为全新的,我怀疑它们比已经运行了几个月的磁盘要多一点。)