为具有有限协作距离的行分配唯一值的解决方案

bas*_*n22 9 sql-server sql-server-2016

我有一个可以使用以下代码创建和填充的表:

CREATE TABLE dbo.Example(GroupKey int NOT NULL, RecordKey varchar(12) NOT NULL);
ALTER TABLE dbo.Example
    ADD CONSTRAINT iExample PRIMARY KEY CLUSTERED(GroupKey ASC, RecordKey ASC);
INSERT INTO dbo.Example(GroupKey, RecordKey)
VALUES (1, 'Archimedes'), (1, 'Newton'), (1, 'Euler'), (2, 'Euler'), (2, 'Gauss'),
       (3, 'Gauss'), (3, 'Poincaré'), (4, 'Ramanujan'), (5, 'Neumann'),
       (5, 'Grothendieck'), (6, 'Grothendieck'), (6, 'Tao');
Run Code Online (Sandbox Code Playgroud)

对于基于与另一行的有限协作距离的所有行RecordKey,我想分配一个唯一值 - 我不关心唯一值是如何或什么数据类型。

可以使用以下查询生成符合我要求的正确结果集:

SELECT 1 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(1, 2, 3)
UNION ALL
SELECT 2 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey = 4
UNION ALL
SELECT 3 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(5, 6)
ORDER BY SupergroupKey ASC, GroupKey ASC, RecordKey ASC;
Run Code Online (Sandbox Code Playgroud)

为了更好地帮助我提出问题,我将解释为什么GroupKeys 1-3 具有相同的SupergroupKey

  • GroupKey1 包含RecordKey欧拉,而欧拉又包含在GroupKey2 中;因此GroupKeys 1 和 2 必须具有相同的SupergroupKey
  • 因为高斯包含在GroupKeys 2 和 3 中,所以它们也必须具有相同的SupergroupKey。这导致GroupKeys 1-3 具有相同的SupergroupKey
  • 由于GroupKeys 1-3 不RecordKey与其余的GroupKeys共享任何s ,因此它们是唯一被赋值SupergroupKey为 1的值。

我应该补充一点,解决方案必须是通用的。上表和结果集只是一个例子。

附录

我删除了解决方案非迭代的要求。虽然我更喜欢这样的解决方案,但我认为这是一个不合理的限制。不幸的是,我无法使用任何基于 CLR 的解决方案;但是如果您想包含这样的解决方案,请随意。不过,我可能不会接受它作为答案。

我的真实表中的行数高达 500 万,但有时行数“仅”在一万左右。平均有 8 RecordKeys perGroupKey和 4 GroupKeys per RecordKey。我想一个解决方案将具有指数时间复杂度,但我仍然对一个解决方案感兴趣。

Mic*_*een 10

这个问题是关于跟踪项目之间的链接。这将其置于图形和图形处理领域。具体来说,整个数据集形成一个图,我们正在寻找该图的组件。这可以通过问题中的样本数据图来说明。

在此处输入图片说明

问题说我们可以按照 GroupKey 或 RecordKey 查找共享该值的其他行。所以我们可以将两者都视为图中的顶点。问题继续解释 GroupKeys 1-3 如何具有相同的 SupergroupKey。这可以看作是左侧的集群由细线连接。图中还展示了原始数据形成的另外两个组件(SupergroupKey)。

SQL Server 具有一些内置于 T-SQL 中的图形处理能力。但是,此时它非常微薄,并且对解决此问题没有帮助。SQL Server 还能够调用 R 和 Python,以及它们可用的丰富而强大的包套件。其中之一是igraph。它是为“快速处理具有数百万个顶点和边(链接)的大型图而编写的” 。

使用 R 和 igraph,我能够在本地测试1 中在 2 分 22 秒内处理一百万行。这是它与当前最佳解决方案的比较方式:

Record Keys     Paul White  R               
------------    ----------  --------
Per question    15ms        ~220ms
100             80ms        ~270ms
1,000           250ms       430ms
10,000          1.4s        1.7s
100,000         14s         14s
1M              2m29        2m22s
1M              n/a         1m40    process only, no display

The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.
Run Code Online (Sandbox Code Playgroud)

处理 1M 行时,使用 1m40s 来加载和处理图形,并更新表。用输出填充 SSMS 结果表需要 42 秒。

在处理 100 万行时观察任务管理器表明需要大约 3GB 的工作内存。这在此系统上可用,无需分页。

我可以确认 Ypercube 对递归 CTE 方法的评估。有几百个记录键,它消耗了 100% 的 CPU 和所有可用的 RAM。最终 tempdb 增长到超过 80GB,SPID 崩溃了。

我将 Paul 的表与 SupergroupKey 列一起使用,以便在解决方案之间进行公平的比较。

出于某种原因,R 反对庞加莱的口音。将其更改为普通的“e”允许它运行。我没有调查,因为它与手头的问题没有密切关系。我确定有解决办法。

这是代码

-- This captures the output from R so the base table can be updated.
drop table if exists #Results;

create table #Results
(
    Component   int         not NULL,
    Vertex      varchar(12) not NULL primary key
);


truncate table #Results;    -- facilitates re-execution

declare @Start time = sysdatetimeoffset();  -- for a 'total elapsed' calculation.

insert #Results(Component, Vertex)
exec sp_execute_external_script   
    @language = N'R',
    @input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
    @script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';

-- Write SuperGroupKey to the base table, as other solutions do
update e
set
    SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
    on r.Vertex = e.RecordKey;

-- Return all rows, as other solutions do
select
    e.SupergroupKey,
    e.GroupKey,
    e.RecordKey
from dbo.Example as e;

-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);

Run Code Online (Sandbox Code Playgroud)

这就是 R 代码的作用

  • @input_data_1 是 SQL Server 将数据从表传输到 R 代码并将其转换为名为 InputDataSet 的 R 数据帧的方式。

  • library(igraph) 将库导入到 R 执行环境中。

  • df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)将数据加载到 igraph 对象中。这是一个无向图,因为我们可以跟踪从组到记录或记录到组的链接。InputDataSet 是 SQL Server 发送到 R 的数据集的默认名称。

  • cpts <- components(df.g, mode = c("weak")) 处理图以查找离散子图(组件)和其他度量。

  • OutputDataSet <- data.frame(cpts$membership)SQL Server 期望从 R 返回一个数据框。它的默认名称是 OutputDataSet。组件存储在称为“成员资格”的向量中。此语句将向量转换为数据框。

  • OutputDataSet$VertexName <- V(df.g)$nameV() 是图中顶点的向量 - GroupKeys 和 RecordKeys 的列表。这会将它们复制到输出数据框中,创建一个名为 VertexName 的新列。这是用于匹配源表以更新 SupergroupKey 的键。

我不是 R 专家。可能这可以优化。

测试数据

OP 的数据用于验证。对于规模测试,我使用了以下脚本。

drop table if exists Records;
drop table if exists Groups;

create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go

set nocount on;

-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount    int             = 1000000;

-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier     numeric(4, 2)   = 2.7;

-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount     int             = FLOOR(@RecordCount * @Multiplier);


-- This is a poor man's numbers table.
insert Groups(GroupKey)
select top(@GroupCount)
    ROW_NUMBER() over (order by (select NULL))
from sys.objects as a
cross join sys.objects as b
--cross join sys.objects as c  -- include if needed


declare @c int = 0
while @c < @RecordCount
begin
    -- Can't use a set-based method since RAND() gives the same value for all rows.
    -- There are better ways to do this, but it works well enough.
    -- RecordKeys will be 10 letters, a-z.
    insert Records(RecordKey)
    select
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND()));

    set @c += 1;
end


-- Process each RecordKey in alphabetical order.
-- For each choose 8 GroupKeys to pair with it.
declare @RecordKey varchar(12) = '';
declare @Groups table (GroupKey int not null);

truncate table dbo.Example;

select top(1) @RecordKey = RecordKey 
from Records 
where RecordKey > @RecordKey 
order by RecordKey;

while @@ROWCOUNT > 0
begin
    print @Recordkey;

    delete @Groups;

    insert @Groups(GroupKey)
    select distinct C
    from
    (
        -- Hard-code * from OP's statistics
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
    ) as T(C);

    insert dbo.Example(GroupKey, RecordKey)
    select
        GroupKey, @RecordKey
    from @Groups;

    select top(1) @RecordKey = RecordKey 
    from Records 
    where RecordKey > @RecordKey 
    order by RecordKey;
end

-- Rebuild the indexes to have a consistent environment
alter index iExample on dbo.Example rebuild partition = all 
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, 
      ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);


-- Check what we ended up with:
select COUNT(*) from dbo.Example;  -- Should be @RecordCount * 8
                                   -- Often a little less due to random clashes
select 
    ByGroup = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by GroupKey)) 
    from dbo.Example
) as T(C);

select
    ByRecord = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by RecordKey)) 
    from dbo.Example
) as T(C);

Run Code Online (Sandbox Code Playgroud)

我刚刚意识到我从 OP 的定义中得到了错误的比率。我不相信这会影响时间。记录和组与此过程对称。对于算法来说,它们都只是图中的节点。

在测试中,数据总是形成一个单一的组成部分。我相信这是由于数据的均匀分布。如果不是将静态的 1:8 比率硬编码到生成例程中,我允许比率发生变化,那么更有可能有更多的组件。



1机器规格:Microsoft SQL Server 2017 (RTM-CU12)、Developer Edition(64 位)、Windows 10 Home。16GB RAM,SSD,4 核超线程 i7,标称 2.8GHz。测试是当时唯一运行的项目,除了正常的系统活动(大约 4% 的 CPU)。


Pau*_*ite 7

这是一个用于性能比较的迭代 T-SQL 解决方案。

它假设可以向表中添加一个额外的列来存储超级组键,并且可以更改索引:

设置

DROP TABLE IF EXISTS 
    dbo.Example;

CREATE TABLE dbo.Example
(
    SupergroupKey integer NOT NULL
        DEFAULT 0, 
    GroupKey integer NOT NULL, 
    RecordKey varchar(12) NOT NULL,

    CONSTRAINT iExample 
    PRIMARY KEY CLUSTERED 
        (GroupKey ASC, RecordKey ASC),

    CONSTRAINT [IX dbo.Example RecordKey, GroupKey]
    UNIQUE NONCLUSTERED (RecordKey, GroupKey),

    INDEX [IX dbo.Example SupergroupKey, GroupKey]
        (SupergroupKey ASC, GroupKey ASC)
);

INSERT dbo.Example
    (GroupKey, RecordKey)
VALUES 
    (1, 'Archimedes'), 
    (1, 'Newton'),
    (1, 'Euler'),
    (2, 'Euler'),
    (2, 'Gauss'),
    (3, 'Gauss'),
    (3, 'Poincaré'),
    (4, 'Ramanujan'),
    (5, 'Neumann'),
    (5, 'Grothendieck'),
    (6, 'Grothendieck'),
    (6, 'Tao');
Run Code Online (Sandbox Code Playgroud)

如果您能够反转当前主键的键顺序,则不需要额外的唯一索引。

大纲

该解决方案的方法是:

  1. 将超级组 ID 设置为 1
  2. 找到编号最小的未处理组密钥
  3. 如果没有找到,退出
  4. 使用当前组键为所有行设置超级组
  5. 为当前组中相关的所有行设置超级组
  6. 重复步骤 5 直到没有行被更新
  7. 增加当前的超级组id
  8. 转到第 2 步

执行

内联评论:

-- No execution plans or rows affected messages
SET NOCOUNT ON;
SET STATISTICS XML OFF;

-- Reset all supergroups
UPDATE E
SET SupergroupKey = 0
FROM dbo.Example AS E
    WITH (TABLOCKX)
WHERE 
    SupergroupKey != 0;

DECLARE 
    @CurrentSupergroup integer = 0,
    @CurrentGroup integer = 0;

WHILE 1 = 1
BEGIN
    -- Next super group
    SET @CurrentSupergroup += 1;

    -- Find the lowest unprocessed group key
    SELECT 
        @CurrentGroup = MIN(E.GroupKey)
    FROM dbo.Example AS E
    WHERE 
        E.SupergroupKey = 0;

    -- Exit when no more unprocessed groups
    IF @CurrentGroup IS NULL BREAK;

    -- Set super group for all records in the current group
    UPDATE E
    SET E.SupergroupKey = @CurrentSupergroup
    FROM dbo.Example AS E 
    WHERE 
        E.GroupKey = @CurrentGroup;

    -- Iteratively find all groups for the super group
    WHILE 1 = 1
    BEGIN
        WITH 
            RecordKeys AS
            (
                SELECT DISTINCT
                    E.RecordKey
                FROM dbo.Example AS E
                WHERE
                    E.SupergroupKey = @CurrentSupergroup
            ),
            GroupKeys AS
            (
                SELECT DISTINCT
                    E.GroupKey
                FROM RecordKeys AS RK
                JOIN dbo.Example AS E
                    WITH (FORCESEEK)
                    ON E.RecordKey = RK.RecordKey
            )
        UPDATE E WITH (TABLOCKX)
        SET SupergroupKey = @CurrentSupergroup
        FROM GroupKeys AS GK
        JOIN dbo.Example AS E
            ON E.GroupKey = GK.GroupKey
        WHERE
            E.SupergroupKey = 0
        OPTION (RECOMPILE, QUERYTRACEON 9481); -- The original CE does better

        -- Break when no more related groups found
        IF @@ROWCOUNT = 0 BREAK;
    END;
END;

SELECT
    E.SupergroupKey,
    E.GroupKey,
    E.RecordKey
FROM dbo.Example AS E;
Run Code Online (Sandbox Code Playgroud)

执行计划

对于密钥更新:

更新计划

结果

表的最终状态是:

DROP TABLE IF EXISTS 
    dbo.Example;

CREATE TABLE dbo.Example
(
    SupergroupKey integer NOT NULL
        DEFAULT 0, 
    GroupKey integer NOT NULL, 
    RecordKey varchar(12) NOT NULL,

    CONSTRAINT iExample 
    PRIMARY KEY CLUSTERED 
        (GroupKey ASC, RecordKey ASC),

    CONSTRAINT [IX dbo.Example RecordKey, GroupKey]
    UNIQUE NONCLUSTERED (RecordKey, GroupKey),

    INDEX [IX dbo.Example SupergroupKey, GroupKey]
        (SupergroupKey ASC, GroupKey ASC)
);

INSERT dbo.Example
    (GroupKey, RecordKey)
VALUES 
    (1, 'Archimedes'), 
    (1, 'Newton'),
    (1, 'Euler'),
    (2, 'Euler'),
    (2, 'Gauss'),
    (3, 'Gauss'),
    (3, 'Poincaré'),
    (4, 'Ramanujan'),
    (5, 'Neumann'),
    (5, 'Grothendieck'),
    (6, 'Grothendieck'),
    (6, 'Tao');
Run Code Online (Sandbox Code Playgroud)

演示:db<>小提琴

性能测试

使用 Michael Green's answer 中提供的扩展测试数据集,我的笔记本电脑* 上的时间是:

-- No execution plans or rows affected messages
SET NOCOUNT ON;
SET STATISTICS XML OFF;

-- Reset all supergroups
UPDATE E
SET SupergroupKey = 0
FROM dbo.Example AS E
    WITH (TABLOCKX)
WHERE 
    SupergroupKey != 0;

DECLARE 
    @CurrentSupergroup integer = 0,
    @CurrentGroup integer = 0;

WHILE 1 = 1
BEGIN
    -- Next super group
    SET @CurrentSupergroup += 1;

    -- Find the lowest unprocessed group key
    SELECT 
        @CurrentGroup = MIN(E.GroupKey)
    FROM dbo.Example AS E
    WHERE 
        E.SupergroupKey = 0;

    -- Exit when no more unprocessed groups
    IF @CurrentGroup IS NULL BREAK;

    -- Set super group for all records in the current group
    UPDATE E
    SET E.SupergroupKey = @CurrentSupergroup
    FROM dbo.Example AS E 
    WHERE 
        E.GroupKey = @CurrentGroup;

    -- Iteratively find all groups for the super group
    WHILE 1 = 1
    BEGIN
        WITH 
            RecordKeys AS
            (
                SELECT DISTINCT
                    E.RecordKey
                FROM dbo.Example AS E
                WHERE
                    E.SupergroupKey = @CurrentSupergroup
            ),
            GroupKeys AS
            (
                SELECT DISTINCT
                    E.GroupKey
                FROM RecordKeys AS RK
                JOIN dbo.Example AS E
                    WITH (FORCESEEK)
                    ON E.RecordKey = RK.RecordKey
            )
        UPDATE E WITH (TABLOCKX)
        SET SupergroupKey = @CurrentSupergroup
        FROM GroupKeys AS GK
        JOIN dbo.Example AS E
            ON E.GroupKey = GK.GroupKey
        WHERE
            E.SupergroupKey = 0
        OPTION (RECOMPILE, QUERYTRACEON 9481); -- The original CE does better

        -- Break when no more related groups found
        IF @@ROWCOUNT = 0 BREAK;
    END;
END;

SELECT
    E.SupergroupKey,
    E.GroupKey,
    E.RecordKey
FROM dbo.Example AS E;
Run Code Online (Sandbox Code Playgroud)

* Microsoft SQL Server 2017 (RTM-CU13),开发者版(64 位),Windows 10 专业版,16GB RAM,SSD,4 核超线程 i7,标称 2.4GHz。


ype*_*eᵀᴹ 6

递归 CTE 方法 - 在大表中可能效率极低:

WITH rCTE AS 
(
    -- Anchor
    SELECT 
        GroupKey, RecordKey, 
        CAST('|' + CAST(GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS GroupKeys,
        CAST('|' + CAST(RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS RecordKeys,
        1 AS lvl
    FROM Example

    UNION ALL

    -- Recursive
    SELECT
        e.GroupKey, e.RecordKey, 
        CASE WHEN r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.GroupKeys + CAST(e.GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.GroupKeys
        END,
        CASE WHEN r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.RecordKeys + CAST(e.RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.RecordKeys
        END,
        r.lvl + 1
    FROM rCTE AS r
         JOIN Example AS e
         ON  e.RecordKey = r.RecordKey
         AND r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
         -- 
         OR e.GroupKey = r.GroupKey
         AND r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
)
SELECT 
    ROW_NUMBER() OVER (ORDER BY GroupKeys) AS SuperGroupKey,
    GroupKeys, RecordKeys
FROM rCTE AS c
WHERE NOT EXISTS
      ( SELECT 1
        FROM rCTE AS m
        WHERE m.lvl > c.lvl
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
        OR    m.lvl = c.lvl
          AND ( m.GroupKey > c.GroupKey
             OR m.GroupKey = c.GroupKey
             AND m.RecordKeys > c.RecordKeys
              )
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
          AND c.GroupKeys LIKE '%|' + CAST(m.GroupKey AS VARCHAR(10)) + '|%'
      ) 
OPTION (MAXRECURSION 0) ;
Run Code Online (Sandbox Code Playgroud)

dbfiddle.uk 中测试