带有 SQL Server 的多个 PVSCSI

Jos*_*ira 12 sql-server vmware virtualisation

关于 SQL Server 虚拟化,一直试图寻找信息是否对将数据设备与日志设备分离到不同的准虚拟 SCSI (PVSCSI) 适配器有积极的性能影响,类似于这里所做的。

在客户端上有一个场景,其中添加了额外的 PVSCSI,并将日志设备分离到新的 PVSCSI,显示出可观的性能提升。然而,怀疑是由于这种分离还是仅仅因为现在存在额外的 PVSCSI。

众所周知,日志磁盘通常以顺序方式写入,而数据磁盘在其 r/w 中遵循更随机的模式,将这两种不同类型的文件放在不同的磁盘上具有性能优势。

但是控制器呢?将这些不同的模式保存在单独的 PVSCSI 控制器中是否也有好处?

任何人都对此有任何见解?

提前致谢

sqL*_*dLe 16

我将分两部分回答:首先“为什么关于分离顺序和随机的传统答案通常不适用。”

然后,我将讨论在 Windows 物理磁盘上分离文件的潜在好处,以及添加额外的 vHBA 并在它们之间分配物理磁盘的潜在好处。

期望在 Windows 物理磁盘级别分离随机和顺序磁盘 IO 的好处通常假设 HDD 设备用于数据存储。它还通常假定单独的 Windows 物理磁盘意味着单独的 HDD 设备。这个想法是一些 HDD 主要处理顺序磁盘 IO,并且磁盘磁头移动非常有限(例如,HDD 托管单个繁忙的 txlog*),而一组单独的 HDD 正在处理随机磁盘 IO。

这些假设在今天很少成立——尤其是在虚拟机中。首先,除非虚拟机 Windows 物理磁盘是 RDM,否则多个数据存储可能位于单个数据存储中 - 或者多个数据存储可能位于单个 ESXi 主机 LUN 上。因此,可以在 ESXi 主机级别混合来宾中分离的内容。

但是假设使用了 RDM,或者每个来宾物理磁盘都位于自己的数据存储、自己的 ESXi LUN 上。即便如此,来宾中单独的顺序和随机 io 经常在阵列中混合,因为呈现给 ESXi 主机的 LUN 可能来自同一个磁盘设备池。现在几乎每个存储阵列都这样做 - 无论是专门的还是作为一个选项来简化管理和提高阵列效率/资源利用率。

最后,当今如此多的存储要么是全闪存,要么是混合闪存 + HDD。无需担心头部移动,Flash 不关心序列与随机的分离……甚至不关心 IO 编织。

所以......这些都是将顺序与随机分开的所有原因可能并不是那么有益。接下来为什么在物理磁盘上传播文件和在 vHBA 之间传播物理磁盘仍然可以提高性能。

*我在此 HDD 示例中特意提到了单个事务日志。当几个独立的顺序磁盘 IO 流(例如 8 个繁忙的事务日志)发生在同一个 HDD 上时 - 除非几乎所有活动都在 SAN 缓存中 - 顺序 IO 磁道之间的持续磁头移动会导致 IO 编织。这是一种特定类型的磁盘磁头抖动,导致磁盘延迟“比随机更糟糕”。发生在 RAID5 和 RAID10 上,尽管在显着降级之前 RAID10 在这方面比 RAID5 可以容忍更多的变化。


Now - given that longwinded talk about how separating sequential from random might not help - how can spreading files across physicaldisks still help? How can spreading physicaldisks among vHBAs help?

It's all about disk IO queues.

Any Windows physicaldisk or LogicalDisk can have up to 255 outstanding disk IOs at a time in what is reported by perfmon as "Current Disk Queue". From the outstanding disk IOs in the physicaldisk queue, storport can pass up to 254 to the minidriver. But the minidriver may also have both a service queue (passed down to the next lower level) and a wait queue. And storport can be told to lower the number it passes on from 254.

In a VMware Windows guest, the pvscsi driver has a default "device" queue depth of 64, where the device is a physicaldisk. So although perfmon could show up to 255 disk IOs in "current disk queue length" for a single physicaldisk, only up to 64 of them would be passed to the next level at a time (unless defaults are changed).

How many disk IOs can be outstanding to one busy transaction log at a time? Well, transaction log writes can be up to 60kb in size. During a high scale ETL, I'll often see every write to the txlog at 60kb. The txlog writer can have up to 32 writes of 60kb outstanding to one txlog at a time. So what if I've got a busy staging txlog and a busy dw txlog on the same physicaldisk, with default VMware settings? If both txlogs are maxing out at 32 outstanding 60kb writes each, that physicaldisk is at its queue depth of 64. Now… what if there are also flatfiles as an ETL source on the physicaldisk? Well… between reads to the flatfiles and txlog writes, they'd have to use the wait queue in, because only 64 can get out at a time. For databases with busy txlogs like that, whether physical server or virtual, I recommend the txlog on its own physicaldisk, with nothing else on the physicaldisk. That prevents queueing at that level and also eliminates any concern with contents of multiple files interleaving(which is a much, much lesser concern these days).

How many disk IOs can be outstanding to a rowfile at a time(from SQL Server's perspective, not necessarily submitted to lower levels)? There isn't really a limit in SQL Server itself(that I've found, anyway). But assuming the file is on a single Windows physicaldisk (I do not recommend using striped dynamic disks for SQL Server, that's a topic for another time), there is a limit. It's the 255 I mentioned before.

With the magic of SQL Server readahead and asynchronous IO, I've seen 4 concurrent queries each running in serial drive a total "current disk queue length" of over 1200! Because of the 255 limit, that isn't even possible with all rowfile contents on a single physicaldisk. It was against a primary filegroup with 8 files, each on own physicaldisk.

So readahead reads can be very aggressive, and can stress IO queues. They can be so aggressive that other rowfile reads and writes end up waiting. If transaction logs are on the same physicaldisk as rowfiles, during simultaneous readahead reads and txlog writes it's very easy for waiting to take place. Even if that waiting isn't at the "current disk queue length" level, it may be waiting at the device queue (64 by default with pvscsi).

Backup reads against rowfiles can also be aggressive, especially if buffercount has been tuned in order to maximize backup throughput.

There's one more SQL Server io type to be aware of when considering isolating txlogs: query spill to tempdb. When query spill takes place, each spilling working writes to tempdb. Got a lot of parallel workers all spilling at the same time? That can be quite a write load. Keeping a busy txlog and important rowfiles away from that can be really helpful :-)

Now, it is possible to change the default device queue depth for the pvscsi driver. It defaults to 64, and can be set as high as 254 which is the most storport will pass on. But be careful changing this. I always recommend aligning the guest device queue depth with the underlying ESXi host LUN queue depth. And setting ESXi host LUN queue depth per array best practices. Using an EMC VNX? Host LUN queue depth should be 32. Guest uses RDMs? Great. Set guest pvscsi device queue depth to 32 so it's aligned with the ESXi host LUN queue depth. EMC VMAX? Typically 64 at ESXi host level, 64 in guest. Pure/Xtremio/IBM FlashSystem? Sometimes host LUN queue depth will be set as high as 256! Go ahead and set pvscsi device queue depth to 254 (Max possible) then.

Here's a link with instructions. https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2053145

The link also talks about requestringpages - WhatAreThose?? They determine queue depth for the pvscsi adapter itself. Each page gives 32 slots in adapter queue depth. By default, requestringpages is 8 for an adapter queue depth of 256. It can be set as high as 32 for 1024 adapter queue depth slots.

Let's say everything is at default. I've got 8 physicaldisks with rowfiles on them, and SQL Server is lightly busy. There's an average of 32 "current disk queue length" across the 8, and none are higher than 64(everything fits in the various device service queues). Great - that gives 256 OIO. It fits in the device service queues, it fits in the adapter service queue so all 256 make it out of the guest to queues at the ESX host level.

But… if things get a little busier, so an average of 64 with some physical disks' queue as high as 128. For those devices with more than 64 outstanding, the overage is in a wait queue. If more than 256 is in the devices' service queue across the 8 physicaldisks, the overage there is in a wait queue until slots in the adapter service queue open up.

In that case, adding another pvscsi vHBA and spreading the physicaldisks between them doubles the total adapter queue depth to 512. More io can be passed from guest to host at the same time.

Something similar could be achieved by staying at one pvscsi adapter and increasing requestringpages. Going to 16 would yield 512 slots, and 32 yields 1024 slots.

When possible, I recommend going wide (adding adapters) before going deep (increasing adapter queue depth). But… on many of the busiest systems, gotta do both: put 4 vHBAs on the guest, and increase requestringpages to 32.

There are lots of other considerations, too. Things like sioc and adaptive queue depth throttling if vmdks are used, configuration of multipathing, configuration of the ESXi adapter beyond LUN queue depth, etc.

But I don't want to overstay my welcome :-)

Lonny Niederstadt @sqL_handLe