mdadm 将硬盘标记为故障,即使它处于原始健康状态?

Cob*_*ast 5 linux raid hard-drive timeout mdadm

我的自定义 NAS 配置为在闲置 20 分钟后降低驱动器转速。

刚才我检查/proc/mdstat并注意到一个驱动器被标记为失败,但是 SMART 显示驱动器运行状况非常好。因此我怀疑 md-raid 认为启动时间太长并标记驱动器失败。

重新添加和重建似乎也不是问题。

dmesg 显示了以下有趣的行,我在谷歌上找不到太多。

[97144.228682] sd 0:0:2:0: attempting task abort! scmd(ffff97f7b14ce948)
[97144.228688] sd 0:0:2:0: [sdc] tag#0 CDB: opcode=0x12 12 00 00 00 24 00
[97144.228692] scsi target0:0:2: handle(0x000c), sas_address(0x5001438020b9ee12), phy(18)
[97144.228694] scsi target0:0:2: enclosure_logical_id(0x5001438020b9ee25), slot(49)
[97148.184253] sd 0:0:2:0: task abort: SUCCESS scmd(ffff97f7b14ce948)
[97148.235864] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
--- last message repeated a couple dozen times ---
[97148.490304] sd 0:0:2:0: [sdc] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490308] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490310] sd 0:0:2:0: [sdc] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490315] sd 0:0:2:0: [sdc] tag#13 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e af f0 00 00 00 10 00 00
[97148.490317] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490321] print_req_error: I/O error, dev sdc, sector 225357808
[97148.490326] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490331] sd 0:0:2:0: [sdc] tag#16 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 18 00 00 00 20 00 00
[97148.490334] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490337] print_req_error: I/O error, dev sdc, sector 225357848
[97148.490341] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490354] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490358] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490366] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490370] sd 0:0:2:0: [sdc] tag#15 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490374] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490378] sd 0:0:2:0: [sdc] tag#15 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ae 68 00 00 00 08 00 00
[97148.490380] print_req_error: I/O error, dev sdc, sector 225357416
[97148.490383] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490392] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490399] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490403] sd 0:0:2:0: [sdc] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490407] sd 0:0:2:0: [sdc] tag#14 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad 90 00 00 00 30 00 00
[97148.490409] print_req_error: I/O error, dev sdc, sector 225357200
[97148.490435] sd 0:0:2:0: [sdc] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490439] sd 0:0:2:0: [sdc] tag#11 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad c8 00 00 00 58 00 00
[97148.490441] print_req_error: I/O error, dev sdc, sector 225357256
[97148.490450] sd 0:0:2:0: [sdc] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490454] sd 0:0:2:0: [sdc] tag#10 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad 00 00 00 00 50 00 00
[97148.490456] print_req_error: I/O error, dev sdc, sector 225357056
[97148.490464] sd 0:0:2:0: [sdc] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490468] sd 0:0:2:0: [sdc] tag#9 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
[97148.490472] print_req_error: I/O error, dev sdc, sector 16
[97148.490474] md: super_written gets error=10
[97148.490477] md/raid:md0: Disk failure on sdc, disabling device.
               md/raid:md0: Operation continuing on 3 devices.
[97148.490496] sd 0:0:2:0: [sdc] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490500] sd 0:0:2:0: [sdc] tag#8 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 40 00 00 00 20 00 00
[97148.490502] print_req_error: I/O error, dev sdc, sector 225357888
[97148.490510] sd 0:0:2:0: [sdc] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490514] sd 0:0:2:0: [sdc] tag#7 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e af b8 00 00 00 30 00 00
[97148.490516] print_req_error: I/O error, dev sdc, sector 225357752
[97148.490524] sd 0:0:2:0: [sdc] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490528] sd 0:0:2:0: [sdc] tag#6 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 00 00 00 00 08 00 00
[97148.490530] print_req_error: I/O error, dev sdc, sector 225357824
Run Code Online (Sandbox Code Playgroud)

是否可以增加超时值以使 md-raid 等待几分钟让驱动器联机?
将来有没有其他选择可以防止这种情况发生(除了让我的驱动器 24/7 全天候旋转,因为我也想时不时地睡觉)?


更新 2017-10-07

更新控制器固件(它是 Perc H310 交叉闪烁到 9211-8i IT 模式),更新 SAS 扩展器固件和增加超时似乎大大降低了上述错误频率,但它们仍然发生,并且在某些情况下 md-raid 仍然存在驱动器故障。

我已经解码了 SAS 错误代码:

Value           31110101h
Type:           30000000h       SAS
Origin:         01000000h       PL
Code:           00110000h       PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code:       00000100h       PL_LOGINFO_SUB_CODE_OPEN_FAILURE
SubSub Code:    00000001h       PL_LOGINFO_SUB_CODE_OPEN_FAILURE_NO_DEST_TIMEOUT
Run Code Online (Sandbox Code Playgroud)

除了在线简要说明外,我找不到任何其他内容(在 2009 年的 LSI pdf 中):

无法打开连接,错误 Open Reject (No Destination)。重试了 50 毫秒。

经过一些进一步的测试(通过一个简单的命令引发问题,hdparm -y ...将驱动器旋转下来并hddtemp ...旋转它们)我发现超时略高于 11 秒,这很奇怪,因为唯一的超时设置值为 10 “顺序”、“可移动”和“未知”设备的通用 I/O 超时。


更新 2017-10-08

这是我的设置的拓扑:

Dell Perc H310 (LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.39.02.00)) (flashed to 9211-8i IT-mode)
    `- HP SAS Expander card (FW 2.10)
        |- Hitachi HDS72404 } md0
        |- Hitachi HDS72404 } md0
        |- HGST HDN724040AL } md0
        |- HGST HDN724040AL } md0
        |- ST8000AS0002-1NA (btrfs)
        |- ST8000AS0002-1NA (btrfs)
        `- ST8000AS0002-1NA (xfs)
Run Code Online (Sandbox Code Playgroud)

四个 Hitachi/HGST 硬盘组成了 md-raid 阵列,Seagate 硬盘与 md-raid 无关但也受根问题的影响(但 btrfs 似乎并不关心)。

这是我到目前为止所做的,经过数小时的研究和试验,并没有太大帮助:

在启动时运行以下代码,增加一些mpt2sas超时:

for f in /sys/block/sd?/device/timeout; do
        echo 90 > "$f"
done

for f in /sys/block/sd?/device/eh_timeout; do
        echo 90 > "$f"
done

for f in /sys/class/scsi_disk/*/manage_start_stop; do
        echo 1 > "$f"
done
Run Code Online (Sandbox Code Playgroud)

我已经更新了我的 HBA 和扩展器固件。

我已将 HBA BIOS 配置实用程序中的所有超时设置为 90 秒。

然而,在 11 到 12 秒后从待机状态唤醒(旋转)硬盘驱动器期间,超时仍然会发生。(我怀疑 10 秒超时,因为这是很多超时的默认设置,还有一些额外的延迟。)


更新 2017-10-10

我现在已经编写了一个脚本,它不断地扫描dmesg掉下的 md 设备并自动mdadm --manage /dev/md0 --re-add /dev/sdx为它们发出问题。使用写意图位图恢复现在只需几秒钟而不是一天。但这不可能是这个问题的正确解决方案。

我也刚刚写信给博通,也许他们可以提供帮助。


更新 2017-10-11

我正在调试我的内核以解决可能的问题:

--drive put to standby with hdparm -y--
18:16:35 sd 0:0:1:0: [sdb] sd_open
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc94ea548
18:16:35 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e0 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:35 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e0 00
18:16:35 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:35 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:35 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:35 sd 0:0:1:0: [sdb] sd_release
18:16:35 sd 0:0:1:0: [sdb] sd_check_events
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bc866e148
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense()=>SUCCESS [nasty midlayer TURs] 
18:16:35 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 sd 0:0:1:0: tag#0 Sense Key : Unit Attention [current] 
18:16:35 sd 0:0:1:0: tag#0 Add. Sense: Power on, reset, or bus device reset occurred
18:16:35 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:35 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bc866e148
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense()=>SUCCESS [nasty midlayer TURs] 
18:16:35 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 sd 0:0:1:0: tag#0 Sense Key : Not Ready [current] 
18:16:35 sd 0:0:1:0: tag#0 Add. Sense: Logical unit not ready, initializing command required
18:16:35 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
--command executed on drive with hddtemp--
18:16:45 sd 0:0:1:0: [sdb] sd_open
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: Inquiry 12 00 00 00 24 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: Inquiry 12 00 00 00 24 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 0)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:45 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:45 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 sd 0:0:1:0: [sdb] tag#0 Done: TIMEOUT_ERROR Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:53 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:53 sd 0:0:1:0: [sdb] tag#0 abort scheduled
18:16:53 sd 0:0:1:0: [sdb] tag#0 aborting command
18:16:53 sd 0:0:1:0: attempting task abort! scmd(ffff989bc8669548)
18:16:53 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 scsi target0:0:1: handle(0x000a), sas_address(0x5001438020b9ee10), phy(16)
18:16:53 scsi target0:0:1: enclosure_logical_id(0x5001438020b9ee25), slot(51)
18:16:57 sd 0:0:1:0: task abort: SUCCESS scmd(ffff989bc8669548)
18:16:57 sd 0:0:1:0: [sdb] tag#0 finish aborted command
18:16:57 sd 0:0:1:0: Notifying upper driver of completion (result 30000)
18:16:57 sd 0:0:1:0: [sdb] sd_release
18:16:57 sd 0:0:1:0: [sdb] sd_check_events
18:16:57 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:57 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:57 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:57 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:57 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:57 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:57 sd 0:0:1:0: unblocking device at zero depth
18:16:57 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: Notifying upper driver of completion (result b0000)
18:16:58 sd 0:0:1:0: device_block, handle(0x000a)
18:16:59 sd 0:0:1:0: device_unblock and setting to running, handle(0x000a)
Run Code Online (Sandbox Code Playgroud)

我觉得特别担心的是

18:16:53 sd 0:0:1:0: [sdb] tag#0 Done: TIMEOUT_ERROR Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Run Code Online (Sandbox Code Playgroud)

立即导致

18:16:53 sd 0:0:1:0: [sdb] tag#0 abort scheduled
18:16:53 sd 0:0:1:0: [sdb] tag#0 aborting command
Run Code Online (Sandbox Code Playgroud)

我想知道在哪里定义超时以及如何更改它。


更新 2017-10-13

通过调试,我在实践中遇到了以下超时:

  • 7s
  • 15s
  • 20s
  • 90 年代(如设置/sys/block/sd?/device/timeout
  • 180s(似乎是之前设置的两倍)

内核源代码中定义了其他超时:

./include/linux/blkdev.h

#define BLK_DEFAULT_SG_TIMEOUT  (60 * HZ)
#define BLK_MIN_SG_TIMEOUT  (7 * HZ)
Run Code Online (Sandbox Code Playgroud)

./include/scsi/scsi.h

#define FORMAT_UNIT_TIMEOUT     (2 * 60 * 60 * HZ)
#define START_STOP_TIMEOUT      (60 * HZ)
#define MOVE_MEDIUM_TIMEOUT     (5 * 60 * HZ)
#define READ_ELEMENT_STATUS_TIMEOUT (5 * 60 * HZ)
#define READ_DEFECT_DATA_TIMEOUT    (60 * HZ )
Run Code Online (Sandbox Code Playgroud)

这些应用于./block/scsi_ioctl.c函数sg_scsi_ioctl(...)blk_fill_sghdr_rq(...).

这解释了 7 秒短暂超时的来源 ( BLK_MIN_SG_TIMEOUT)。

15s 和 20s 超时似乎来自sg_io_hdr*->timeoutinblk_fill_sghdr_rq(...)但我无法找到它之前设置的位置。

Dav*_*ill 2

当然,这只是驱动器确实有故障。

您正在超时/旋转中寻求复杂的答案,而现实是

[97148.490321] print_req_error:I/O 错误,dev sdc,扇区 225357808

控制器无法读取或写入驱动器的特定扇区。高速缓存通常会在旋转继续进行时接受写入。

通常,无论 smartctl 说什么,这种情况只会出现在真正有故障的驱动器上。

更换不同的驱动器会有什么不同吗?