我可以让 md（Linux 软件 RAID）更加容错吗？

Question

我可以让 md（Linux 软件 RAID）更加容错吗？

Nic*_*ick 6 raid debian hard-drive software-raid

我的 RAID 1 镜像中有一个特定的硬盘驱动器，它在重负载下会发生故障，通常是在我运行完整备份时。

驱动没有问题。好的，在写入超级块时出现一个错误，但仅此而已。每次运行备份过程时，始终需要手动重新添加同一磁盘和同一阵列。

是否有任何设置可以md更好地容忍导致该驱动器在负载下发生故障的原因？

这是 Debian 上的 Linux 软件 RAID。

更新：根据要求，DMSG失败时的输出：

[2347429.116507] print_req_error: I/O error, dev sda, sector 15751347328
[2347429.116511] sd 1:0:0:0: [sda] tag#1058 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[2347429.116516] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116518] sd 1:0:0:0: [sda] tag#1058 CDB: Write(16) 8a 08 00 00 00 00 00 00 00 28 00 00 00 08 00 00
[2347429.116522] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116523] print_req_error: I/O error, dev sda, sector 40
[2347429.116526] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116529] print_req_error: I/O error, dev sda, sector 40
[2347429.116532] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116533] md: super_written gets error=10
[2347429.116536] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[2347429.116538] md/raid1:md127: Disk failure on sda, disabling device.
                 md/raid1:md127: Operation continuing on 1 devices.

Run Code Online (Sandbox Code Playgroud)

我还刚刚运行了一个简短的 SMART 离线测试，其中Completed without error. 状态是：

SMART overall-health self-assessment test result: PASSED

Run Code Online (Sandbox Code Playgroud)

更新2：输出smartctl -a /dev/sda

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-25-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Ultrastar He10
Device Model:     HGST HUH721010ALE604
Serial Number:    1EK1W8WZ
LU WWN Device Id: 5 000cca 27eeb2150
Firmware Version: LHGNW384
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Nov 17 13:20:55 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (   93) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (1167) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   151   151   024    Pre-fail  Always       -       429 (Average 442)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       99
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       28468
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       99
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1327
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1327
194 Temperature_Celsius     0x0002   181   181   000    Old_age   Always       -       33 (Min/Max 17/45)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       6783827

SMART Error Log Version: 1
ATA Error Count: 65535 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 65535 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.906  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.906  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.904  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.890  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.890  READ LOG EXT

Error 65534 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.890  READ FPDMA QUEUED
  60 00 08 00 a2 db 40 00  14d+02:47:36.880  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.865  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.865  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.856  READ FPDMA QUEUED

Error 65533 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.865  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.865  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.856  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.840  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.840  READ LOG EXT

Error 65532 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.840  READ FPDMA QUEUED
  60 00 08 00 a2 db 40 00  14d+02:47:36.836  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.823  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.823  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.820  READ FPDMA QUEUED

Error 65531 occurred at disk power-on lifetime: 28414 hours (1183 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 00 a1 db 40 00  14d+02:47:36.823  READ FPDMA QUEUED
  60 00 08 00 a2 db 40 00  14d+02:47:36.820  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00  14d+02:47:36.806  READ LOG EXT
  2f 00 01 10 00 00 00 00  14d+02:47:36.806  READ LOG EXT
  60 00 08 00 a2 db 40 00  14d+02:47:36.796  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     28440         -
# 2  Extended offline    Completed without error       00%        18         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Run Code Online (Sandbox Code Playgroud)

Answer 1

vid*_*rlo 30

当您进行备份时，您会读取大量数据。驱动器可能会返回读取错误，并因此而被丢弃。这种情况可能只发生在驱动器的某些特定区域，通常不会被读取。

问题是驱动器不可靠。您应该更换驱动器，而不是尝试让 MD 接受它。MD 放弃它是有原因的——它不值得信赖。

Answer 2

sho*_*hok 23

单个读取错误不会将磁盘从阵列中启动，至少在 2012 年后的内核上是如此。

来自md 手册页：

在更高版本的内核中，读取错误将导致 md 尝试通过覆盖坏块来恢复。即它会从其他地方找到正确的数据，将其写入失败的块，然后尝试再次读回。如果写入或重新读取失败，md 将以与写入错误相同的方式处理错误，并使整个设备失败。

对于要从阵列中删除的设备，应该发生以下两种情况之一：

受影响扇区出现读错误后又出现写错误
链接重置由内核发出（您可以通过找到它dmesg）

如果您确定磁盘正常，请尝试重新安装它和/或更换 SATA/电源线。

如果问题仍然存在，请将其更换。

编辑：您的dmesg输出清楚地显示了如何sda存在一些严重的问题。我会尽快更换它。

*如果问题仍然存在，请更换它。*在更换任何磁盘之前，我也会在备份期间检查所有系统温度。也许系统只需要清洁其风扇。然后，我还会计算满载系统的电源要求，并根据实际电源容量进行检查。备份运行时是否有电压下降？ (2认同)

Answer 3

bob*_*lux 13

我有类似的问题。我首先检查了驱动器的 SMART 信息中的读取错误计数，但没有任何错误计数。然而操作系统报告了错误并且驱动器被踢出了 RAID。

结果是 SATA 电缆出现故障。

很好的答案，因为记住这些事情可能是由有故障的电缆（甚至是全新的电缆）引起的，这一点始终很重要，而且通常没有任何真正的迹象表明它。几年前，我[在博客上写了这件事如何咬我](https://bakins-bits.dev/dev/2014/05/bad-cables-can-masquerade-as-other-errors/)。 (3认同)

归档时间：	2 年，3 月前
查看次数：	2279 次
最近记录：	2 年，3 月前