Joe*_*ish 14 sql-server sql-server-2017
我的部分工作量使用了一个CLR 函数,该函数实现了诡异的哈希算法来比较行以查看是否有任何列值发生了变化。CLR 函数将二进制字符串作为输入,因此我需要一种快速的方法将行转换为二进制字符串。我希望在整个工作负载期间散列大约 100 亿行,所以我希望这段代码尽可能快。
我有大约 300 个不同架构的表。出于这个问题的目的,请假设一个简单的表结构,包含 32 个可空INT
列。我在这个问题的底部提供了示例数据以及一种对结果进行基准测试的方法。
如果所有列值都相同,则行必须转换为相同的二进制字符串。如果任何列值不同,则必须将行转换为不同的二进制字符串。例如,像下面这样简单的代码将不起作用:
CAST(COL1 AS BINARY(4)) + CAST(COL2 AS BINARY(4)) + ..
Run Code Online (Sandbox Code Playgroud)
它不能正确处理 NULL。如果COL1
第 1 行为COL2
NULL,第 2 行为 NULL,则两行都将转换为 NULL 字符串。我相信正确处理 NULL 是正确转换整行的最难部分。INT 列的所有允许值都是可能的。
先抢答一些问题:
将 32 个可INT
为空的列转换为 aBINARY(X)
或VARBINARY(X)
string的最快方法是什么?
承诺的示例数据和代码:
-- create sample data
DROP TABLE IF EXISTS dbo.TABLE_OF_32_INTS;
CREATE TABLE dbo.TABLE_OF_32_INTS (
COL1 INT NULL,
COL2 INT NULL,
COL3 INT NULL,
COL4 INT NULL,
COL5 INT NULL,
COL6 INT NULL,
COL7 INT NULL,
COL8 INT NULL,
COL9 INT NULL,
COL10 INT NULL,
COL11 INT NULL,
COL12 INT NULL,
COL13 INT NULL,
COL14 INT NULL,
COL15 INT NULL,
COL16 INT NULL,
COL17 INT NULL,
COL18 INT NULL,
COL19 INT NULL,
COL20 INT NULL,
COL21 INT NULL,
COL22 INT NULL,
COL23 INT NULL,
COL24 INT NULL,
COL25 INT NULL,
COL26 INT NULL,
COL27 INT NULL,
COL28 INT NULL,
COL29 INT NULL,
COL30 INT NULL,
COL31 INT NULL,
COL32 INT NULL
);
INSERT INTO dbo.TABLE_OF_32_INTS WITH (TABLOCK)
SELECT 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, 0, 123, 12345, 1234567, 123456789
, NULL, -876545321
FROM
(
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);
GO
-- procedure to test performance
CREATE OR ALTER PROCEDURE #p AS
BEGIN
SET NOCOUNT ON;
DECLARE
@counter INT = 0,
@dummy VARBINARY(8000);
WHILE @counter < 10
BEGIN
SELECT @dummy = -- this code is clearly incomplete as it does not handle NULLs
CAST(COL1 AS BINARY(4)) +
CAST(COL2 AS BINARY(4)) +
CAST(COL3 AS BINARY(4)) +
CAST(COL4 AS BINARY(4)) +
CAST(COL5 AS BINARY(4)) +
CAST(COL6 AS BINARY(4)) +
CAST(COL7 AS BINARY(4)) +
CAST(COL8 AS BINARY(4)) +
CAST(COL9 AS BINARY(4)) +
CAST(COL10 AS BINARY(4)) +
CAST(COL11 AS BINARY(4)) +
CAST(COL12 AS BINARY(4)) +
CAST(COL13 AS BINARY(4)) +
CAST(COL14 AS BINARY(4)) +
CAST(COL15 AS BINARY(4)) +
CAST(COL16 AS BINARY(4)) +
CAST(COL17 AS BINARY(4)) +
CAST(COL18 AS BINARY(4)) +
CAST(COL19 AS BINARY(4)) +
CAST(COL20 AS BINARY(4)) +
CAST(COL21 AS BINARY(4)) +
CAST(COL22 AS BINARY(4)) +
CAST(COL23 AS BINARY(4)) +
CAST(COL24 AS BINARY(4)) +
CAST(COL25 AS BINARY(4)) +
CAST(COL26 AS BINARY(4)) +
CAST(COL27 AS BINARY(4)) +
CAST(COL28 AS BINARY(4)) +
CAST(COL29 AS BINARY(4)) +
CAST(COL30 AS BINARY(4)) +
CAST(COL31 AS BINARY(4)) +
CAST(COL32 AS BINARY(4))
FROM dbo.TABLE_OF_32_INTS
OPTION (MAXDOP 1);
SET @counter = @counter + 1;
END;
SELECT cpu_time
FROM sys.dm_exec_requests
WHERE session_id = @@SPID;
END;
GO
-- run procedure
EXEC #p;
Run Code Online (Sandbox Code Playgroud)
(我仍然会在这个二进制结果上使用诡异的散列。工作负载使用散列连接,散列值用于其中一个散列构建。我不想在散列构建中使用长二进制值,因为它需要太多记忆。)
Pau*_*ite 11
在我的机器(SQL Server 2017)上,以下 C# SQLCLR 函数的运行速度比binary(5)
想法快 30%,比快 35% CONCAT_WS
,而且是自我回答时间的一半。
它需要UNSAFE
许可并使用指针。该实现与测试数据非常相关。
出于测试目的,使此不安全程序集正常工作的最简单方法是将数据库设置为TRUSTWORTHY
并在必要时禁用clr 严格安全配置选项。
为方便起见,CREATE ASSEMBLY
编译位位于https://gist.github.com/SQLKiwi/72d01b661c74485900e7ebcfdc63ab8e
T-SQL 函数存根
CREATE FUNCTION dbo.NullableIntsToBinary
(
@Col01 int, @Col02 int, @Col03 int, @Col04 int, @Col05 int, @Col06 int, @Col07 int, @Col08 int,
@Col09 int, @Col10 int, @Col11 int, @Col12 int, @Col13 int, @Col14 int, @Col15 int, @Col16 int,
@Col17 int, @Col18 int, @Col19 int, @Col20 int, @Col21 int, @Col22 int, @Col23 int, @Col24 int,
@Col25 int, @Col26 int, @Col27 int, @Col28 int, @Col29 int, @Col30 int, @Col31 int, @Col32 int
)
RETURNS binary(132)
WITH EXECUTE AS CALLER
AS EXTERNAL NAME Obbish.UserDefinedFunctions.NullableIntsToBinary;
Run Code Online (Sandbox Code Playgroud)
C# 源代码位于https://gist.github.com/SQLKiwi/64f320fe7fd802a68a3a644aa8b8af9f
如果您自己编译,则必须使用类库 (.dll) 作为目标项目类型并选中“允许不安全代码”构建选项。
由于您最终想要计算上面返回的二进制数据的 SpookyHash,您可以在 CLR 函数中调用 SpookyHash 并返回 16 字节的哈希值。
基于具有混合列数据类型的表的示例实现位于https://gist.github.com/SQLKiwi/6f82582a4ad1920c372fac118ec82460。这包括源自 Jon Hanna 的SpookilySharp的 Spooky Hash 算法的不安全内联版本和Bob Jenkins的原始公共域C 源代码。
一INT
列有四个字节的允许值,它们与 a 的大小完全匹配BINARY(4)
。换句话说,BINARY(4) 的每个可能值都与INT
列的可能值相匹配。因此,除非INT
列中有不允许的值,否则没有安全的 NULL 替换。列是否为 NULL 必须单独编码。它根本无法放入BINARY(4)
.
一种方法是使用 NULL 位图。考虑以下代码:
CAST(
CASE WHEN COL1 IS NOT NULL THEN 0 ELSE 1 END |
CASE WHEN COL2 IS NOT NULL THEN 0 ELSE 2 END |
CASE WHEN COL3 IS NOT NULL THEN 0 ELSE 4 END |
CASE WHEN COL4 IS NOT NULL THEN 0 ELSE 8 END |
CASE WHEN COL5 IS NOT NULL THEN 0 ELSE 16 END |
CASE WHEN COL6 IS NOT NULL THEN 0 ELSE 32 END |
CASE WHEN COL7 IS NOT NULL THEN 0 ELSE 64 END |
CASE WHEN COL8 IS NOT NULL THEN 0 ELSE 128 END
AS BINARY(1))
Run Code Online (Sandbox Code Playgroud)
八列是否为 NULL 适合单个字节。可以在行之间比较这些表达式以检查所有相同的列是 NULL 还是非 NULL。有了这些附加信息,用任何非 NULL 值替换 NULL 列值就变得安全了。我发现CAST(ISNULL(COL1, 0) AS BINARY(4))
它是最快的,尽管其他变化ISNULL(CAST(COL1 AS VARBINARY(4)), 0x)
也是可能的。
很难肯定地证明任何事情,但我发现以下细节是最快的:
在我的机器上,基准测试大约需要 27.5 个 CPU 秒。不幸的是,NULL 位图步骤需要大约三分之一的时间。如果有更快的方法来做到这一点会很好。
这是完整的解决方案:
SELECT
CAST(ISNULL(COL1, 0) AS BINARY(4)) +
CAST(ISNULL(COL2, 0) AS BINARY(4)) +
CAST(ISNULL(COL3, 0) AS BINARY(4)) +
CAST(ISNULL(COL4, 0) AS BINARY(4)) +
CAST(ISNULL(COL5, 0) AS BINARY(4)) +
CAST(ISNULL(COL6, 0) AS BINARY(4)) +
CAST(ISNULL(COL7, 0) AS BINARY(4)) +
CAST(ISNULL(COL8, 0) AS BINARY(4)) +
CAST(ISNULL(COL9, 0) AS BINARY(4)) +
CAST(ISNULL(COL10, 0) AS BINARY(4)) +
CAST(ISNULL(COL11, 0) AS BINARY(4)) +
CAST(ISNULL(COL12, 0) AS BINARY(4)) +
CAST(ISNULL(COL13, 0) AS BINARY(4)) +
CAST(ISNULL(COL14, 0) AS BINARY(4)) +
CAST(ISNULL(COL15, 0) AS BINARY(4)) +
CAST(ISNULL(COL16, 0) AS BINARY(4)) +
CAST(ISNULL(COL17, 0) AS BINARY(4)) +
CAST(ISNULL(COL18, 0) AS BINARY(4)) +
CAST(ISNULL(COL19, 0) AS BINARY(4)) +
CAST(ISNULL(COL20, 0) AS BINARY(4)) +
CAST(ISNULL(COL21, 0) AS BINARY(4)) +
CAST(ISNULL(COL22, 0) AS BINARY(4)) +
CAST(ISNULL(COL23, 0) AS BINARY(4)) +
CAST(ISNULL(COL24, 0) AS BINARY(4)) +
CAST(ISNULL(COL25, 0) AS BINARY(4)) +
CAST(ISNULL(COL26, 0) AS BINARY(4)) +
CAST(ISNULL(COL27, 0) AS BINARY(4)) +
CAST(ISNULL(COL28, 0) AS BINARY(4)) +
CAST(ISNULL(COL29, 0) AS BINARY(4)) +
CAST(ISNULL(COL30, 0) AS BINARY(4)) +
CAST(ISNULL(COL31, 0) AS BINARY(4)) +
CAST(ISNULL(COL32, 0) AS BINARY(4)) +
CAST(
CASE WHEN COL1 IS NOT NULL THEN 0 ELSE 1 END |
CASE WHEN COL2 IS NOT NULL THEN 0 ELSE 2 END |
CASE WHEN COL3 IS NOT NULL THEN 0 ELSE 4 END |
CASE WHEN COL4 IS NOT NULL THEN 0 ELSE 8 END |
CASE WHEN COL5 IS NOT NULL THEN 0 ELSE 16 END |
CASE WHEN COL6 IS NOT NULL THEN 0 ELSE 32 END |
CASE WHEN COL7 IS NOT NULL THEN 0 ELSE 64 END |
CASE WHEN COL8 IS NOT NULL THEN 0 ELSE 128 END
AS BINARY(1)) +
CAST(
CASE WHEN COL9 IS NOT NULL THEN 0 ELSE 1 END |
CASE WHEN COL10 IS NOT NULL THEN 0 ELSE 2 END |
CASE WHEN COL11 IS NOT NULL THEN 0 ELSE 4 END |
CASE WHEN COL12 IS NOT NULL THEN 0 ELSE 8 END |
CASE WHEN COL13 IS NOT NULL THEN 0 ELSE 16 END |
CASE WHEN COL14 IS NOT NULL THEN 0 ELSE 32 END |
CASE WHEN COL15 IS NOT NULL THEN 0 ELSE 64 END |
CASE WHEN COL16 IS NOT NULL THEN 0 ELSE 128 END
AS BINARY(1)) +
CAST(
CASE WHEN COL17 IS NOT NULL THEN 0 ELSE 1 END |
CASE WHEN COL18 IS NOT NULL THEN 0 ELSE 2 END |
CASE WHEN COL19 IS NOT NULL THEN 0 ELSE 4 END |
CASE WHEN COL20 IS NOT NULL THEN 0 ELSE 8 END |
CASE WHEN COL21 IS NOT NULL THEN 0 ELSE 16 END |
CASE WHEN COL22 IS NOT NULL THEN 0 ELSE 32 END |
CASE WHEN COL23 IS NOT NULL THEN 0 ELSE 64 END |
CASE WHEN COL24 IS NOT NULL THEN 0 ELSE 128 END
AS BINARY(1)) +
CAST(
CASE WHEN COL25 IS NOT NULL THEN 0 ELSE 1 END |
CASE WHEN COL26 IS NOT NULL THEN 0 ELSE 2 END |
CASE WHEN COL27 IS NOT NULL THEN 0 ELSE 4 END |
CASE WHEN COL28 IS NOT NULL THEN 0 ELSE 8 END |
CASE WHEN COL29 IS NOT NULL THEN 0 ELSE 16 END |
CASE WHEN COL30 IS NOT NULL THEN 0 ELSE 32 END |
CASE WHEN COL31 IS NOT NULL THEN 0 ELSE 64 END |
CASE WHEN COL32 IS NOT NULL THEN 0 ELSE 128 END
AS BINARY(1))
FROM dbo.TABLE_OF_32_INTS
OPTION (MAXDOP 1);
Run Code Online (Sandbox Code Playgroud)
使用BINARY(5)
NULL 并将其转换为超出范围的 INT怎么样:
SELECT @dummy =
ISNULL(CAST(COL1 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL2 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL3 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL4 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL5 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL6 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL7 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL8 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL9 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL10 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL11 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL12 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL13 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL14 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL15 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL16 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL17 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL18 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL19 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL20 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL21 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL22 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL23 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL24 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL25 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL26 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL27 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL28 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL29 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL30 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL31 AS BINARY(5)), 0x0100000000) +
ISNULL(CAST(COL32 AS BINARY(5)), 0x0100000000)
FROM dbo.TABLE_OF_32_INTS
OPTION (MAXDOP 1);
Run Code Online (Sandbox Code Playgroud)
在我的测试中,concat_ws比您的空位图解决方案(26 秒)快一点(18 秒)。将有更多的数据要混洗,因此您可能会在其他地方看到性能下降,如果您想将其与字符列混合,则必须明智地选择分隔符。
select @dummy = cast(concat_ws('|',
isnull(cast(T.COL1 as varchar(11)), ''),
isnull(cast(T.COL2 as varchar(11)), ''),
isnull(cast(T.COL3 as varchar(11)), ''),
isnull(cast(T.COL4 as varchar(11)), ''),
isnull(cast(T.COL5 as varchar(11)), ''),
isnull(cast(T.COL6 as varchar(11)), ''),
isnull(cast(T.COL7 as varchar(11)), ''),
isnull(cast(T.COL8 as varchar(11)), ''),
isnull(cast(T.COL9 as varchar(11)), ''),
isnull(cast(T.COL10 as varchar(11)), ''),
isnull(cast(T.COL11 as varchar(11)), ''),
isnull(cast(T.COL12 as varchar(11)), ''),
isnull(cast(T.COL13 as varchar(11)), ''),
isnull(cast(T.COL14 as varchar(11)), ''),
isnull(cast(T.COL15 as varchar(11)), ''),
isnull(cast(T.COL16 as varchar(11)), ''),
isnull(cast(T.COL17 as varchar(11)), ''),
isnull(cast(T.COL18 as varchar(11)), ''),
isnull(cast(T.COL19 as varchar(11)), ''),
isnull(cast(T.COL20 as varchar(11)), ''),
isnull(cast(T.COL21 as varchar(11)), ''),
isnull(cast(T.COL22 as varchar(11)), ''),
isnull(cast(T.COL23 as varchar(11)), ''),
isnull(cast(T.COL24 as varchar(11)), ''),
isnull(cast(T.COL25 as varchar(11)), ''),
isnull(cast(T.COL26 as varchar(11)), ''),
isnull(cast(T.COL27 as varchar(11)), ''),
isnull(cast(T.COL28 as varchar(11)), ''),
isnull(cast(T.COL29 as varchar(11)), ''),
isnull(cast(T.COL30 as varchar(11)), ''),
isnull(cast(T.COL31 as varchar(11)), ''),
isnull(cast(T.COL32 as varchar(11)), ''))
as varbinary(8000))
from dbo.TABLE_OF_32_INTS as T
option (maxdop 1)
Run Code Online (Sandbox Code Playgroud)