HP P840 HDD RAID 5 很多奇怪的驱动器故障

Lao*_*ord 6 raid hp-smart-array raid5 raid-controller drive-failure

我已经在我的 HP P840 上使用 RAID5 硬盘存储 (8x6TB) 大约 2 年了,它总是出现异常多的驱动器故障。半年一切都很好,但现在驱动器以一种奇怪的方式出现故障。例如,2 个新驱动器在添加到 RAID 几天后出现故障。我也已经更换了 RAID 控制器,并在主板和 RAID 控制器上使用了最新固件。

我也尝试使用不同的驱动器。最初在该 RAID 中使用了 HGST DeskStar 6TB 驱动器,现在我在更换故障驱动器时已将它们替换为 HGST UltraStar 6TB。但行为是一样的。

此外,似乎(大多数)驱动器并没有真正发生故障,因为一旦我更换了 RAID 控制器,一个发生故障的驱动器就会再次被识别为正常并开始重建。

我的主机支持说问题是我实际上使用的是 RAID5,我应该改用 RAID10。我很难相信,因为我一直在使用 RAID5,在其他系统上没有问题(多年来没有出现驱动器故障)。

谁能给我一个提示,罪魁祸首可能是什么?RAID 控制器的配置方式有问题吗?

谢谢!

编辑:
服务器是 HP DL180 G9
驱动器故障的原因始终是“写入重试失败”

更新:我们的主机提供我们完全更换硬件并切换到 RAID6。我们这样做了,现在已经顺利运行了一段时间。虽然这并没有得到真正的调查,但我相信 shodanshok 对穿孔阵列的解释似乎是合理的。因此我会接受这个答案。谢谢大家!

  Smart Array P840 in Slot 1                (sn: PDNNF0ARH321GD)


     Port Name: 1I

     Port Name: 2I

     Internal Drive Cage at Port 1I, Box 2, OK

     Internal Drive Cage at Port 1I, Box 2, OK

     Internal Drive Cage at Port 2I, Box 1, OK
     array A (Solid State SATA, Unused Space: 0  MB)


  logicaldrive 1 (447.1 GB, RAID 1+0, OK)

  physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)

     array B (SATA, Unused Space: 0  MB)


  logicaldrive 2 (38.2 TB, RAID 5, Interim Recovery Mode)

  physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:5 (port 1I:box 2:bay 5, SATA, 6001.1 GB, Failed)
  physicaldrive 1I:2:6 (port 1I:box 2:bay 6, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:7 (port 1I:box 2:bay 7, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:8 (port 1I:box 2:bay 8, SATA, 6001.1 GB, OK)
Run Code Online (Sandbox Code Playgroud)

详细信息:

     Smart Array P840 in Slot 1
        Bus Interface: PCI
        Slot: 1
        Serial Number: PDNNF0ARH321GD
        Cache Serial Number: PEYFP0BRH323YZ
        RAID 6 (ADG) Status: Enabled
        Controller Status: OK
        Hardware Revision: B
        Firmware Version: 6.60
        Rebuild Priority: High
        Expand Priority: Medium
        Surface Scan Delay: 3 secs
        Surface Scan Mode: Idle
        Parallel Surface Scan Supported: Yes
        Current Parallel Surface Scan Count: 1
        Max Parallel Surface Scan Count: 16
        Queue Depth: Automatic
        Monitor and Performance Delay: 60  min
        Elevator Sort: Enabled
        Degraded Performance Optimization: Disabled
        Inconsistency Repair Policy: Disabled
        Wait for Cache Room: Disabled
        Surface Analysis Inconsistency Notification: Disabled
        Post Prompt Timeout: 15 secs
        Cache Board Present: True
     Cache Status: OK
     Cache Ratio: 10% Read / 90% Write
     Drive Write Cache: Enabled
     Total Cache Size: 4.0 GB
     Total Cache Memory Available: 3.2 GB
     No-Battery Write Cache: Enabled
     SSD Caching RAID5 WriteBack Enabled: True
     SSD Caching Version: 2
     Cache Backup Power Source: Batteries
     Battery/Capacitor Count: 1
     Battery/Capacitor Status: OK
     SATA NCQ Supported: True
     Spare Activation Mode: Activate on physical drive failure (default)
     Controller Temperature (C): 51
     Cache Module Temperature (C): 38
     Number of Ports: 2 Internal only
     Encryption: Disabled
     Express Local Encryption: False
     Driver Name: hpsa
     Driver Version: 3.4.16
     Driver Supports HP SSD Smart Path: True
     PCI Address (Domain:Bus:Device.Function): 0000:06:00.0
     Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
     Controller Mode: RAID
     Controller Mode Reboot: Not Required
     Latency Scheduler Setting: Disabled
     Current Power Mode: MaxPerformance
     Host Serial Number: CZ270500GM
     Sanitize Erase Supported: False
     Primary Boot Volume: logicaldrive 1 (600508B1001CE0F9FACF3A1358647115)
     Secondary Boot Volume: logicaldrive 1 (600508B1001CE0F9FACF3A1358647115)


     Port Name: 1I
           Port ID: 0
           Port Connection Number: 0
           SAS Address: 5001438038AD05A0
           Port Location: Internal
           Managed Cable Connected: False

     Port Name: 2I
           Port ID: 1
           Port Connection Number: 1
           SAS Address: 5001438038AD05A8
           Port Location: Internal
           Managed Cable Connected: False

     Internal Drive Cage at Port 1I, Box 2, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 1I
        Box: 2
        Location: Internal

     Physical Drives
        physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
        None attached


     Internal Drive Cage at Port 1I, Box 2, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 1I
        Box: 2
        Location: Internal

     Physical Drives
        physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
        None attached


     Internal Drive Cage at Port 2I, Box 1, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 2I
        Box: 1
        Location: Internal

     Physical Drives
        physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)
        None attached

     Array: A
        Interface Type: Solid State SATA
        Unused Space: 0  MB (0.0%)
        Used Space: 894.2 GB (100.0%)
        Status: OK
        MultiDomain Status: OK
        Array Type: Data
        HP SSD Smart Path: disable



  Logical Drive: 1
     Size: 447.1 GB
     Fault Tolerance: 1+0
     Heads: 255
     Sectors Per Track: 32
     Cylinders: 65535
     Strip Size: 256 KB
     Full Stripe Size: 512 KB
     Status: OK
     MultiDomain Status: OK
     Caching:  Enabled
     Unique Identifier: 600508B1001CE0F9FACF3A1358647115
     Disk Name: /dev/sda
     Mount Points: / 18.6 GB Partition Number 2
     OS Status: LOCKED
     Logical Drive Label: 0216D6F9PDNNF0ARH502MC7DFA
     Mirror Group 1:
        physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
     Mirror Group 2:
        physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)
     Drive Type: Data
     LD Acceleration Method: Controller Cache

  physicaldrive 2I:1:1
     Port: 2I
     Box: 1
     Bay: 1
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712004AG240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 31
     Maximum Temperature (C): 39
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:2
     Port: 2I
     Box: 1
     Bay: 2
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV706303CH240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 29
     Maximum Temperature (C): 36
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:3
     Port: 2I
     Box: 1
     Bay: 3
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712003V8240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 29
     Maximum Temperature (C): 35
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:4
     Port: 2I
     Box: 1
     Bay: 4
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712004GA240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 31
     Maximum Temperature (C): 37
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False


     Array: B
        Interface Type: SATA
        Unused Space: 0  MB (0.0%)
        Used Space: 43.7 TB (100.0%)
        Status: Failed Physical Drive
        MultiDomain Status: OK
        Array Type: Data
        HP SSD Smart Path: disable

        Warning: One of the drives on this array have failed or has been removed.




  Logical Drive: 2
     Size: 38.2 TB
     Fault Tolerance: 5
     Heads: 255
     Sectors Per Track: 32
     Cylinders: 65535
     Strip Size: 256 KB
     Full Stripe Size: 1792 KB
     Status: Interim Recovery Mode
     MultiDomain Status: OK
     Caching:  Enabled
     Parity Initialization Status: Initialization Failed
     Unique Identifier: 600508B1001CF94F84873C91FD89B549
     Disk Name: /dev/sdb
     Mount Points: None
     Logical Drive Label: 04DA1DD6PDNNF0ARH502MC546F
     Drive Type: Data
     LD Acceleration Method: Controller Cache

  physicaldrive 1I:2:1
     Port: 1I
     Box: 2
     Bay: 1
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: NAHN3UZY
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 37
     Maximum Temperature (C): 43
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:2
     Port: 1I
     Box: 2
     Bay: 2
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNT517
     Serial Number: NAHLKP0X
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 37
     Maximum Temperature (C): 56
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:3
     Port: 1I
     Box: 2
     Bay: 3
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: NCH8E81Z
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 33
     Maximum Temperature (C): 41
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:4
     Port: 1I
     Box: 2
     Bay: 4
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: NAHYMAUY
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 34
     Maximum Temperature (C): 41
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:5
     Port: 1I
     Box: 2
     Bay: 5
     Status: Failed
     Last Failure Reason: Write retries failed
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: K1H942MD
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Maximum Temperature (C): 43
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Applicable
     Sanitize Erase Supported: False

  physicaldrive 1I:2:6
     Port: 1I
     Box: 2
     Bay: 6
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: TDR2
     Serial Number: K8JM5TKN
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 33
     Maximum Temperature (C): 38
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:7
     Port: 1I
     Box: 2
     Bay: 7
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: K8H9BW2N
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 34
     Maximum Temperature (C): 39
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:8
     Port: 1I
     Box: 2
     Bay: 8
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: K1H623JD
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 35
     Maximum Temperature (C): 40
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False
Run Code Online (Sandbox Code Playgroud)

sho*_*hok 10

您可能有一个严重穿孔的阵列,由于条带重建失败,这会导致替换磁盘提前“计划死亡”。您可以在此处此处阅读更多信息

解决方案是备份、销毁阵列、重新创建它并从备份中恢复。

下次避免使用具有如此大驱动器的 RAID5 阵列。我强烈建议使用 RAID6 或更好的 RAID10。


eww*_*ite 5

您应该使用具有系统中磁盘大小和类型的 RAID6。但是,在 HP Smart Array RAID 控制器上运行 RAID5 并没有本质上的错误。我认为您的问题是在未经服务器硬件认证的设置中使用消费者磁盘的结果。

不过,有关服务器的一些详细信息可能会有所帮助。

这是 HPE 服务器,还是您只使用 HPE 控制器?

这些似乎不是 HPE 驱动器或 HPE 驱动器托架。这是一个不好的迹象。

hpssacli您提供的输出还将显示磁盘故障的原因。如果您不在 HPE 服务器上并且存在背板问题或 SATA 超时(注意到您在 SATA 磁盘上),则可能会出现误报。

示例:(请参阅最后失败原因行)

  physicaldrive 2I:2:8
     Port: 2I
     Box: 2
     Bay: 8
     Status: Failed
     Last Failure Reason: Aborted Command
     Drive Type: Data Drive
Run Code Online (Sandbox Code Playgroud)