SMART - 了解离线数据收集

Ban*_*aio 6 ssd smart synology

我在 Synology NAS 中有两个 Kingston A400 120GB SSD 作为缓存,它似乎不支持自动离线数据收集。

# smartctl -d sat -c /dev/sdc | grep -i "Auto Offline data collection" 
Auto Offline Data Collection: Disabled.  
No Auto Offline data collection support.
# smartctl -d sat -o on /dev/sdc
SMART Automatic Timers not supported
SMART Enable Automatic Offline failed: scsi error aborted command
Run Code Online (Sandbox Code Playgroud)

然而,当我检查标记为“离线”的属性时,即使我不运行手动离线数据收集或自测试,RAW_VALUE其中之一也会不断变化(具体而言)。246 Total_Erase_Count我检查了 smartd 是否正在运行以防万一,但事实并非如此。另一个相同的 SSD 也会发生同样的情况。

问题:

  1. 离线数据采集到底更新了什么?它是否只更新属性表中的 VALUE/WORST/THRESH 列?
  2. 短自检或长自检是否会更新 SMART 属性数据?

输出smartctl -a

=== START OF INFORMATION SECTION ===
Model Family:     Phison Driven SSDs
Device Model:     KINGSTON SA400S37120G
Serial Number:    [...]
LU WWN Device Id: [...]
Firmware Version: 03070009
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Fri Apr 12 01:55:30 2019 -03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x35) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging NOT supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.
Conveyance self-test routine
recommended polling time:        (   1) minutes.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME                                                   FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate                                              0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours                                                   0x0032   100   100   000    Old_age   Always       -       710
 12 Power_Cycle_Count                                                0x0032   100   100   000    Old_age   Always       -       5
148 Unknown_Attribute                                                0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute                                                0x0000   100   100   000    Old_age   Offline      -       0
167 Unknown_Attribute                                                0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count                                             0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute                                                0x0000   100   100   000    Old_age   Offline      -       65
170 Bad_Blk_Ct_Erl/Lat                                               0x0000   100   100   010    Old_age   Offline      -       0/78
172 Unknown_Attribute                                                0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct                                                   0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Cnt_Total                                           0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total                                           0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect                                               0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count                                            0x0012   100   100   000    Old_age   Always       -       1
194 Temperature_Celsius                                              0x0022   024   025   000    Old_age   Always       -       24 (Min/Max 24/25)
196 Not_In_Use                                                       0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count                                                  0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count                                                  0x0032   100   100   000    Old_age   Always       -       4
231 SSD_Life_Left                                                    0x0000   100   100   000    Old_age   Offline      -       0
233 Flash_Writes_GiB                                                 0x0032   100   100   000    Old_age   Always       -       396
241 Lifetime_Writes_GiB                                              0x0032   100   100   000    Old_age   Always       -       304
242 Lifetime_Reads_GiB                                               0x0032   100   100   000    Old_age   Always       -       228
244 Average_Erase_Count                                              0x0000   100   100   000    Old_age   Offline      -       2
245 Max_Erase_Count                                                  0x0000   100   100   000    Old_age   Offline      -       10
246 Total_Erase_Count                                                0x0000   100   100   000    Old_age   Offline      -       3827

SMART Error Log not supported

SMART Self-test Log not supported

Selective Self-tests/Logging not supported
Run Code Online (Sandbox Code Playgroud)

sho*_*hok 3

简短的回答: SSD 将内部数据收集和报告封装在复杂的控制器和 FTL 固件后面,因此您在 SMART 级别看到的很少是其内部状态的完整表示。不用担心离线测试明显被禁用,因为控制器很可能会运行自己的健全性测试并相应地更新在线和离线属性(除非它不这样做 - 某些固件故意破坏 SMART 属性,但即使对于 HDD 和您来说也会发生这种情况对此无能为力)。

长答案: SMART offline data collection这是一种定义不明确的收集磁盘数据的方法,原则上会降低 IO 性能,因为特定的测试/收集无法真正与用户数据 IO 并行运行。因此,出现了“离线”一词 - 磁盘固件在离线属性收集期间可以自由地暂停用户 IO。因此,脱机收集可以完全禁用,可以在预定时间明确请求用户,或者(如果磁盘支持)使用编程的计时器自动运行。

然而,离线测试从未正式包含在 ATA 标准中(尽管存在于其他存储相关标准中),从而为(通常未记录的)固件特定行为敞开了大门。

对于我在过去 15 年以上使用的任何磁盘,离线测试实际上都是“在线”测试,在数据收集期间没有性能下降。与在线测试的唯一区别是离线测试是按照特定的固件相关时间表收集的(即:每 4 小时)。

我发现的唯一例外是关于Offline surface scan,一个特定的离线子测试,它扫描整个盘片表面(或 NAND 芯片,对于 SSD)是否有缺陷。作为一项如此密集的测试,它会被专门报告,有时可以有选择地启用/禁用。然而,大多数 HDD(和 SSD)报告不支持表面扫描,而是实施固件和特定于型号的扫描。例如,大多数消费者 HDD 根本不进行表面扫描,而企业磁盘即使在 SMART 报告表面扫描已禁用时也会自动扫描其表面。SSD要复杂得多,需要控制器定期扫描闪存状态来重写边缘页,因此表面扫描对它们来说基本上没有意义。