kuz*_*uz8 7 smart megaraid smartctl dell-perc megacli
我在 Dell R720xd 和 PERC H710P 上的 MegaCli 收到奇怪的 SMART 错误,RAID5 中有五个 4Tb SATA 驱动器
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
Run Code Online (Sandbox Code Playgroud)
给我一些“Failure Seq Event Number”
Slot Number: 4
...
Last Predictive Failure Event Seq Number: 7309
...
Inquiry Data: PK2361PAGAZU8WHitachi HUS724040ALE640 MJAOA3B0
...
Drive has flagged a S.M.A.R.T alert : Yes
Run Code Online (Sandbox Code Playgroud)
但是 smartctl 根本没有提供驱动器有什么问题的线索:
# smartctl -a -d sat+megaraid,4 /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.19.1.el6.x86_64] (local build)
...
Serial Number: PK2361PAGAZU8W # Note same serial, no mistake
...
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 79
3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 426
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 7
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 114 114 020 Pre-fail Offline - 37
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 4912
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 182
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 182
194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Min/Max 19/40)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
Run Code Online (Sandbox Code Playgroud)
没有理由在上面皱眉头..
做了简短的自测,没有发现任何东西,现在开始长时间的测试:
Serial Number: PK2331PAG7EENT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 4911 -
Run Code Online (Sandbox Code Playgroud)
同时,同一阵列中有一个磁盘具有 39 个重新分配的扇区,并且 PERC 不会将其标记为即将失败。smartctl 输出如下:
Serial Number: PK2331PAG7EENT
...
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 39
Run Code Online (Sandbox Code Playgroud)
以及具有 39 个重新分配扇区的同一个磁盘的 MegaCli64 输出:
Slot Number: 0
Last Predictive Failure Event Seq Number: 0
Inquiry Data: PK2331PAG7EENTHitachi HUS724040ALE640 MJAOA3B0
...
Drive has flagged a S.M.A.R.T alert : No
Run Code Online (Sandbox Code Playgroud)
MegaRAID Storage Manager 的报告也没有启发性:
ID = 113
SEQUENCE NUMBER = 7310
TIME = 11-07-2013 20:58:01
LOCALIZED MESSAGE = Controller ID: 0 Unexpected sense: PD = -:-:4Hardware impending failure general hard drive failure, CDB = 0x03 0x00 0x00 0x00 0x40 0x00 , Sense = 0xf0 0x00 0x00 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x5d 0x10 0x00 0x00 0x00 0x00
ID = 96
SEQUENCE NUMBER = 7309
TIME = 11-07-2013 20:58:01
LOCALIZED MESSAGE = Controller ID: 0 PD Predictive failure: -:-:4
Run Code Online (Sandbox Code Playgroud)
所以磁盘看起来很健康,任何想法如何重置 SMART 警报?我认为智能统计数据不足以为其索赔。
PS:我们已经删除了#4,将其插入为#5,它显示健康,它显示为“外来”,这是预期的,现在将其分配为全局热备份。将新驱动器放置为 #4,RAID 重建了卷。戴尔支持建议使用 omconfig 来获取更详细的控制器日志。