9x7 驱动器 raidz2 (ZFS ZoL 0.8.1) 上的缓慢顺序速度

Question

9x7 驱动器 raidz2 (ZFS ZoL 0.8.1) 上的缓慢顺序速度

obr*_*nmd 10 performance zfs storage iscsi zfsonlinux

我正在运行一个大型 ZFS 池，该池为 256K+ 请求大小的顺序读写通过 iSCSI（用于备份）在 Ubuntu 18.04 上运行。考虑到对高吞吐量和空间效率的需求，以及对随机小块性能的较少需求，我使用了条带化raidz2而不是条带化镜像。

然而，256K 的顺序读取性能远低于我的预期（100 - 200MBps，峰值高达 600MBps）。当 zvols 在 iostat 中达到 ~99% iowait 时，后备设备通常在 10% 到 40% iowait 之间运行，这向我表明瓶颈是我在配置中缺少的东西，因为它不应该是背板或 CPU这个系统和顺序工作负载不应该让 ARC 工作得太辛苦。

我已经玩了很多模块参数（下面的当前配置），阅读了数百篇文章，OpenZFS github 上的问题等。调整预取和聚合使我达到了这个性能水平 - 默认情况下，我在大约 50MBps 上运行ZFS 向磁盘发送 TINY 请求时的顺序读取 (~16K)。随着聚合和预取工作正常（我认为），磁盘读取要高得多，在 iostat 中平均约为 64K。

NIC 是 LIO iscsi 目标，具有 cxgbit 卸载 + Windows Chelsio iscsi 启动器在 ZFS zvol 之外运行良好，直接映射的 optane 在 NIC 上返回几乎全线速（~3.5GBps 读写）。

是我期待太多了吗？我知道 ZFS 将安全性置于性能之上，但我希望 7x9 raidz2 提供比单个 9 驱动器 mdadm raid6 更好的顺序读取。

系统规格和日志/配置文件：

Chassis: Supermicro 6047R-E1R72L
HBAs: 3x 2308 IT mode (24x 6Gbps SAS channels to backplanes)
CPU: 2x E5-2667v2 (8 cores @ 3.3Ghz base each)
RAM: 128GB, 104GB dedicated to ARC
HDDs: 65x HGST 10TB HC510 SAS (9x 7-wide raidz2 + 2 spares)
SSDs: 2x Intel Optane 900P (partitioned for mirrored special and log vdevs)
NIC: Chelsio 40GBps (same as on initiator, both using hw offloaded iSCSI)
OS: Ubuntu 18.04 LTS (using latest non-HWE kernel that allows ZFS SIMD)
ZFS: 0.8.1 via PPA
Initiator: Chelsio iSCSI initiator on Windows Server 2019

Run Code Online (Sandbox Code Playgroud)

池配置：

ashift=12
recordsize=128K (blocks on zvols are 64K, below)
compression=lz4
xattr=sa
redundant_metadata=most
atime=off
primarycache=all

Run Code Online (Sandbox Code Playgroud)

ZVol 配置：

sparse
volblocksize=64K (matches OS allocation unit on top of iSCSI)

Run Code Online (Sandbox Code Playgroud)

泳池布局：

7x 9-wide raidz2
mirrored 200GB optane special vdev (SPA metadata allocation classes)
mirrored 50GB optane log vdev

Run Code Online (Sandbox Code Playgroud)

/etc/modprobe.d/zfs.conf：

# 52 - 104GB ARC, this system does nothing else
options zfs zfs_arc_min=55834574848
options zfs zfs_arc_max=111669149696

# allow for more dirty async data
options zfs zfs_dirty_data_max_percent=25
options zfs zfs_dirty_data_max=34359738368

# txg timeout given we have plenty of Optane ZIL
options zfs zfs_txg_timeout=5

# tune prefetch (have played with this 1000x different ways, no major improvement except max_streams to 2048, which helped, I think)
options zfs zfs_prefetch_disable=0
options zfs zfetch_max_distance=134217728
options zfs zfetch_max_streams=2048
options zfs zfetch_min_sec_reap=3
options zfs zfs_arc_min_prefetch_ms=250
options zfs zfs_arc_min_prescient_prefetch_ms=250
options zfs zfetch_array_rd_sz=16777216

# tune coalescing (same-ish, increasing the read gap limit helped throughput in conjunction with low async read max_active, as it caused much bigger reads to be sent to the backing devices)
options zfs zfs_vdev_aggregation_limit=16777216
options zfs zfs_vdev_read_gap_limit=1048576
options zfs zfs_vdev_write_gap_limit=262144

# ZIO scheduler in priority order 
options zfs zfs_vdev_sync_read_min_active=1
options zfs zfs_vdev_sync_read_max_active=10
options zfs zfs_vdev_sync_write_min_active=1
options zfs zfs_vdev_sync_write_max_active=10
options zfs zfs_vdev_async_read_min_active=1
options zfs zfs_vdev_async_read_max_active=2
options zfs zfs_vdev_async_write_min_active=1
options zfs zfs_vdev_async_write_max_active=4

# zvol threads
options zfs zvol_threads=32

Run Code Online (Sandbox Code Playgroud)

我正在为此烦恼。来自用户的压力是使用存储空间去全 Windows，但我使用了奇偶校验存储空间（即使使用顶部有镜像的存储空间直通），它也不漂亮。我很想在 iSCSI 下直接使用 mdadm raid60，但是如果有人能指出我遗漏的愚蠢的东西，它将通过 ZFS 的 bitrot 保护来解锁性能，我会很高兴的 :)

Answer 1

eww*_*ite 7

好问题。

我认为您的稀疏 zvol 块大小应该是 128k。
您的 ZIO 调度程序设置都应该更高，例如最小 10 和最大 64。
zfs_txg_timeout 应该更长。我在我的系统上做了 15 或 30 秒。
我认为多个 RAIDZ3（或者是打字错误）是矫枉过正，并且在性能中发挥了重要作用。您可以使用 RAIDZ2 进行基准测试吗？

编辑：在系统上安装Netdata并监控利用率和 ZFS 统计信息。

Edit2：这是一个 Veeam 存储库。Veeam 支持将 Linux 作为目标，并与 ZFS 完美配合。你会考虑用你的数据进行基准测试吗？zvols 不是您正在做的事情的理想用例，除非 NIC 的卸载是解决方案的关键部分。

归档时间：	6 年，10 月前
查看次数：	1257 次
最近记录：	5 年，3 月前