如何在 Crucial MX500 磁盘上计算 Percent_Lifetime_Remaining SMART 属性?

Rya*_*n J 5 ssd smart zfs

我有一台家庭实验室级服务器,几个月前我在其中放入了 4 个 Crucial MX500 磁盘。其中一张磁盘(它们都相似)具有以下 SMART 详细信息:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.12.2.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model:     CT500MX500SSD1
Serial Number:    XXXXXXXXXXX
LU WWN Device Id: 5 00a075 1e1e22806
Firmware Version: M3CR023
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA >3.2 (0x1ff), 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Aug  9 17:29:43 2019 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0031) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       554
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   092   092   000    Old_age   Always       -       127
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       9
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       43
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   064   046   000    Old_age   Always       -       36 (Min/Max 0/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   092   092   001    Old_age   Offline      -       8
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       7227541253
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       128825080
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       1974407892
Run Code Online (Sandbox Code Playgroud)

我试图弄清楚属性202是如何计算的,因为它似乎下降得很快。我的机器上有 ZFS,正常运行时间为 41 天,zpool iostat -v它显示:

                                capacity     operations     bandwidth
pool                                 alloc   free   read  write   read  write
-----------------------------------  -----  -----  -----  -----  -----  -----
neo                                  1.10T   727G    123    174  4.84M  2.27M
  raidz1                             1.10T   727G    123    174  4.84M  2.27M
    ata-CT500MX500SSD1_1XXXXXXXXXXX      -      -     31     44  1.23M   597K
    ata-CT500MX500SSD1_1XXXXXXXXXXX      -      -     30     42  1.19M   567K
    ata-CT500MX500SSD1_1XXXXXXXXXXX      -      -     31     44  1.23M   597K
    ata-CT500MX500SSD1_1XXXXXXXXXXX      -      -     30     42  1.19M   567K
-----------------------------------  -----  -----  -----  -----  -----  -----
Run Code Online (Sandbox Code Playgroud)

据我了解,每个磁盘的写入速度不到 1MB/s。粗略估计写入的数据量为 1MB/s = 每天 86GB 或每月 2.5TB。每月额定 180TBW / 2.5TB 大约为 72 个月或 6 年。然而,我在大约 2.5 个月的时间里已经经历了 8%。

我想知道属性202是如何计算的,这样我就可以手动计算并开始尝试找出是否存在某种类型的写入放大问题。我有点犹豫是否相信 SMART 统计数据,因为它们显示了 23 天的 PoH,尽管系统有 41 天的正常运行时间,而且特定型号的磁盘也存在臭名昭著的问题CurrentPendingSector

小智 6

传统上,“剩余寿命百分比”是平均擦除次数与实际闪存“额定耐久性”的比较。这意味着MX500具有1500次的闪存寿命,这对于TLC 3D闪存来说是可信的。

ZFS /z1 布局对于消费类 SSD 会推送大量驱动器同步命令。这确实增加了各个驱动器的写入放大。我测试了更大的阵列(按驱动器数量),并且在驱动器内部逻辑之前的 ZFS 级别看到了 > 20 倍的放大(这是 zvol 100% 随机 4K 写入工作负载,所以最坏的情况)。

您的阵列仅比您自己的计算结果差 2.3 倍(2.5 个月后的 8% 是 31 个月,而您希望的 72 个月)。这很容易通过 ZFS 写入行为来解释。换句话说,您将获得 /z1 中带有 ZFS 的消费者驱动器所期望的结果。

非消费类驱动器,即使使用相同的闪存构建,也具有断电硬件,可以让同步发生,而无需每次都实际同步 FTL 布局。这大大降低了磨损,并使 ZFS 更加“可生存”,但与其他文件系统相比仍然很慢。