我有一个磁盘可能损坏的系统,但磁盘通过了各种诊断。我一直无法确认磁盘是否损坏。我有哪些选择?
我可以只更换磁盘,但因为这种情况与我遇到的另一个更严重的情况非常相似(长话短说),我想实际做出正确的诊断,而不是随机装箱硬件。
问题和历史是这样的:
我想确认磁盘坏了,但我所做的一切都没有证实这一点:
smartctl -t long /dev/sda
) 无错误完成。dd if=/dev/sda of=/dev/null bs=4096
以绚丽的色彩传递。我还能做些什么来评估驱动器的健康状况?
同样,这不是要让这个路由器再次完全正常运行,这是一个磁盘取证问题,因为碰巧我有另一台服务器可能有同样的问题,知道这个问题的答案可能会对我有很大帮助。
为了记录,以下是日志等。
这是smartctl -a
输出:
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model: ST3120026A
Serial Number: 5JT1CLQM
Firmware Version: 3.06
User Capacity: 120,034,123,776 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2
Local Time is: Mon Jul 1 21:18:33 2013 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 24) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 85) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 050 046 006 Pre-fail Always - 47766662
3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 10
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 31
7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 820305
9 Power_On_Hours 0x0032 048 048 000 Old_age Always - 46373
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 605
194 Temperature_Celsius 0x0022 036 065 000 Old_age Always - 36
195 Hardware_ECC_Recovered 0x001a 050 046 000 Old_age Always - 47766662
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 196 000 Old_age Always - 6
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Aborted by host 80% 46361 -
# 2 Extended offline Completed without error 00% 46358 -
# 3 Short offline Completed without error 00% 12046 -
# 4 Extended offline Completed without error 00% 10472 -
# 5 Short offline Completed without error 00% 10471 -
# 6 Short offline Completed without error 00% 10471 -
# 7 Short offline Completed without error 00% 6770 -
# 8 Extended offline Aborted by host 90% 5958 -
# 9 Extended offline Aborted by host 90% 5951 -
#10 Short offline Completed without error 00% 5024 -
#11 Extended offline Aborted by host 80% 5024 -
#12 Short offline Completed without error 00% 3697 -
#13 Short offline Completed without error 00% 237 -
#14 Short offline Completed without error 00% 145 -
#15 Short offline Completed without error 00% 69 -
#16 Extended offline Completed without error 00% 68 -
#17 Short offline Completed without error 00% 66 -
#18 Short offline Completed without error 00% 49 -
#19 Short offline Completed without error 00% 29 -
#20 Short offline Completed without error 00% 29 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Run Code Online (Sandbox Code Playgroud)
这是崩溃时的 dmesg 错误(对一堆不同的扇区重复):
[1755091.211136] sd 0:0:0:0: [sda] Unhandled error code
[1755091.211144] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[1755091.211151] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 08 fe ad 38 00 00 08 00
[1755091.211166] end_request: I/O error, dev sda, sector 150908216
Run Code Online (Sandbox Code Playgroud)
你不能可靠。
或者更确切地说,您已经使用可用的选项完成了它。
谷歌的一项研究发现,出现故障的磁盘不一定会显示异常的 SMART 值(但反过来更可靠:当出现异常时,它们就会出现故障)。
暂时把这个放在一边,请记住,尽管计算中的很多内容都已标准化,但实际上硬件和软件中都存在错误,误差范围可能会累积等等。现实世界并不完美,而且也不是完美的。看不见的硬盘与特定控制器不能很好地配合- 反之亦然。有时这是固件故障的问题,有时是一些完全不同的系统组件不工作的问题,例如低于标准的 PSU 在特定负载峰值时崩溃。甚至温度变化、年龄……这个列表几乎可以随意扩展。
因此,这里的标准程序是将磁盘放入显着不同的系统配置中并重新运行测试 - 但由于您已经在完全更改系统的情况下这样做了,因此您已经正确地得出磁盘一定有故障的结论。(除非您没有像您告诉我们的那样更改其他所有内容 - 会想到电缆/HBA,在这种情况下,该假设将不成立)。
编辑:我刚刚意识到还剩下一个选择;您可以搜索是否有可用于此磁盘驱动器的固件版本比您特定驱动器上当前版本更新的固件版本。如果是这样,您可以查看更改日志,指出您的案例中可能存在的问题。
总之,要完全确信(在这种特殊情况下!)驱动器行为异常,您需要将其发送回制造商。