bas*_*n22 9 sql-server sql-server-2016
我有一个可以使用以下代码创建和填充的表:
CREATE TABLE dbo.Example(GroupKey int NOT NULL, RecordKey varchar(12) NOT NULL);
ALTER TABLE dbo.Example
ADD CONSTRAINT iExample PRIMARY KEY CLUSTERED(GroupKey ASC, RecordKey ASC);
INSERT INTO dbo.Example(GroupKey, RecordKey)
VALUES (1, 'Archimedes'), (1, 'Newton'), (1, 'Euler'), (2, 'Euler'), (2, 'Gauss'),
(3, 'Gauss'), (3, 'Poincaré'), (4, 'Ramanujan'), (5, 'Neumann'),
(5, 'Grothendieck'), (6, 'Grothendieck'), (6, 'Tao');
Run Code Online (Sandbox Code Playgroud)
对于基于与另一行的有限协作距离的所有行RecordKey,我想分配一个唯一值 - 我不关心唯一值是如何或什么数据类型。
可以使用以下查询生成符合我要求的正确结果集:
SELECT 1 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(1, 2, 3)
UNION ALL
SELECT 2 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey = 4
UNION ALL
SELECT 3 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(5, 6)
ORDER BY SupergroupKey ASC, GroupKey ASC, RecordKey ASC;
Run Code Online (Sandbox Code Playgroud)
为了更好地帮助我提出问题,我将解释为什么GroupKeys 1-3 具有相同的SupergroupKey:
GroupKey1 包含RecordKey欧拉,而欧拉又包含在GroupKey2 中;因此GroupKeys 1 和 2 必须具有相同的SupergroupKey。GroupKeys 2 和 3 中,所以它们也必须具有相同的SupergroupKey。这导致GroupKeys 1-3 具有相同的SupergroupKey。GroupKeys 1-3 不RecordKey与其余的GroupKeys共享任何s ,因此它们是唯一被赋值SupergroupKey为 1的值。我应该补充一点,解决方案必须是通用的。上表和结果集只是一个例子。
附录
我删除了解决方案非迭代的要求。虽然我更喜欢这样的解决方案,但我认为这是一个不合理的限制。不幸的是,我无法使用任何基于 CLR 的解决方案;但是如果您想包含这样的解决方案,请随意。不过,我可能不会接受它作为答案。
我的真实表中的行数高达 500 万,但有时行数“仅”在一万左右。平均有 8 RecordKeys perGroupKey和 4 GroupKeys per RecordKey。我想一个解决方案将具有指数时间复杂度,但我仍然对一个解决方案感兴趣。
Mic*_*een 10
这个问题是关于跟踪项目之间的链接。这将其置于图形和图形处理领域。具体来说,整个数据集形成一个图,我们正在寻找该图的组件。这可以通过问题中的样本数据图来说明。
问题说我们可以按照 GroupKey 或 RecordKey 查找共享该值的其他行。所以我们可以将两者都视为图中的顶点。问题继续解释 GroupKeys 1-3 如何具有相同的 SupergroupKey。这可以看作是左侧的集群由细线连接。图中还展示了原始数据形成的另外两个组件(SupergroupKey)。
SQL Server 具有一些内置于 T-SQL 中的图形处理能力。但是,此时它非常微薄,并且对解决此问题没有帮助。SQL Server 还能够调用 R 和 Python,以及它们可用的丰富而强大的包套件。其中之一是igraph。它是为“快速处理具有数百万个顶点和边(链接)的大型图而编写的” 。
使用 R 和 igraph,我能够在本地测试1 中在 2 分 22 秒内处理一百万行。这是它与当前最佳解决方案的比较方式:
Record Keys Paul White R
------------ ---------- --------
Per question 15ms ~220ms
100 80ms ~270ms
1,000 250ms 430ms
10,000 1.4s 1.7s
100,000 14s 14s
1M 2m29 2m22s
1M n/a 1m40 process only, no display
The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.
Run Code Online (Sandbox Code Playgroud)
处理 1M 行时,使用 1m40s 来加载和处理图形,并更新表。用输出填充 SSMS 结果表需要 42 秒。
在处理 100 万行时观察任务管理器表明需要大约 3GB 的工作内存。这在此系统上可用,无需分页。
我可以确认 Ypercube 对递归 CTE 方法的评估。有几百个记录键,它消耗了 100% 的 CPU 和所有可用的 RAM。最终 tempdb 增长到超过 80GB,SPID 崩溃了。
我将 Paul 的表与 SupergroupKey 列一起使用,以便在解决方案之间进行公平的比较。
出于某种原因,R 反对庞加莱的口音。将其更改为普通的“e”允许它运行。我没有调查,因为它与手头的问题没有密切关系。我确定有解决办法。
这是代码
-- This captures the output from R so the base table can be updated.
drop table if exists #Results;
create table #Results
(
Component int not NULL,
Vertex varchar(12) not NULL primary key
);
truncate table #Results; -- facilitates re-execution
declare @Start time = sysdatetimeoffset(); -- for a 'total elapsed' calculation.
insert #Results(Component, Vertex)
exec sp_execute_external_script
@language = N'R',
@input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
@script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';
-- Write SuperGroupKey to the base table, as other solutions do
update e
set
SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
on r.Vertex = e.RecordKey;
-- Return all rows, as other solutions do
select
e.SupergroupKey,
e.GroupKey,
e.RecordKey
from dbo.Example as e;
-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);
Run Code Online (Sandbox Code Playgroud)
这就是 R 代码的作用
@input_data_1 是 SQL Server 将数据从表传输到 R 代码并将其转换为名为 InputDataSet 的 R 数据帧的方式。
library(igraph) 将库导入到 R 执行环境中。
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)将数据加载到 igraph 对象中。这是一个无向图,因为我们可以跟踪从组到记录或记录到组的链接。InputDataSet 是 SQL Server 发送到 R 的数据集的默认名称。
cpts <- components(df.g, mode = c("weak")) 处理图以查找离散子图(组件)和其他度量。
OutputDataSet <- data.frame(cpts$membership)SQL Server 期望从 R 返回一个数据框。它的默认名称是 OutputDataSet。组件存储在称为“成员资格”的向量中。此语句将向量转换为数据框。
OutputDataSet$VertexName <- V(df.g)$nameV() 是图中顶点的向量 - GroupKeys 和 RecordKeys 的列表。这会将它们复制到输出数据框中,创建一个名为 VertexName 的新列。这是用于匹配源表以更新 SupergroupKey 的键。
我不是 R 专家。可能这可以优化。
OP 的数据用于验证。对于规模测试,我使用了以下脚本。
drop table if exists Records;
drop table if exists Groups;
create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go
set nocount on;
-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount int = 1000000;
-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier numeric(4, 2) = 2.7;
-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount int = FLOOR(@RecordCount * @Multiplier);
-- This is a poor man's numbers table.
insert Groups(GroupKey)
select top(@GroupCount)
ROW_NUMBER() over (order by (select NULL))
from sys.objects as a
cross join sys.objects as b
--cross join sys.objects as c -- include if needed
declare @c int = 0
while @c < @RecordCount
begin
-- Can't use a set-based method since RAND() gives the same value for all rows.
-- There are better ways to do this, but it works well enough.
-- RecordKeys will be 10 letters, a-z.
insert Records(RecordKey)
select
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND())) +
CHAR(97 + (26*RAND()));
set @c += 1;
end
-- Process each RecordKey in alphabetical order.
-- For each choose 8 GroupKeys to pair with it.
declare @RecordKey varchar(12) = '';
declare @Groups table (GroupKey int not null);
truncate table dbo.Example;
select top(1) @RecordKey = RecordKey
from Records
where RecordKey > @RecordKey
order by RecordKey;
while @@ROWCOUNT > 0
begin
print @Recordkey;
delete @Groups;
insert @Groups(GroupKey)
select distinct C
from
(
-- Hard-code * from OP's statistics
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
union all
select FLOOR(RAND() * @GroupCount)
) as T(C);
insert dbo.Example(GroupKey, RecordKey)
select
GroupKey, @RecordKey
from @Groups;
select top(1) @RecordKey = RecordKey
from Records
where RecordKey > @RecordKey
order by RecordKey;
end
-- Rebuild the indexes to have a consistent environment
alter index iExample on dbo.Example rebuild partition = all
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);
-- Check what we ended up with:
select COUNT(*) from dbo.Example; -- Should be @RecordCount * 8
-- Often a little less due to random clashes
select
ByGroup = AVG(C)
from
(
select CONVERT(float, COUNT(1) over(partition by GroupKey))
from dbo.Example
) as T(C);
select
ByRecord = AVG(C)
from
(
select CONVERT(float, COUNT(1) over(partition by RecordKey))
from dbo.Example
) as T(C);
Run Code Online (Sandbox Code Playgroud)
我刚刚意识到我从 OP 的定义中得到了错误的比率。我不相信这会影响时间。记录和组与此过程对称。对于算法来说,它们都只是图中的节点。
在测试中,数据总是形成一个单一的组成部分。我相信这是由于数据的均匀分布。如果不是将静态的 1:8 比率硬编码到生成例程中,我允许比率发生变化,那么更有可能有更多的组件。
1机器规格:Microsoft SQL Server 2017 (RTM-CU12)、Developer Edition(64 位)、Windows 10 Home。16GB RAM,SSD,4 核超线程 i7,标称 2.8GHz。测试是当时唯一运行的项目,除了正常的系统活动(大约 4% 的 CPU)。
这是一个用于性能比较的迭代 T-SQL 解决方案。
它假设可以向表中添加一个额外的列来存储超级组键,并且可以更改索引:
DROP TABLE IF EXISTS
dbo.Example;
CREATE TABLE dbo.Example
(
SupergroupKey integer NOT NULL
DEFAULT 0,
GroupKey integer NOT NULL,
RecordKey varchar(12) NOT NULL,
CONSTRAINT iExample
PRIMARY KEY CLUSTERED
(GroupKey ASC, RecordKey ASC),
CONSTRAINT [IX dbo.Example RecordKey, GroupKey]
UNIQUE NONCLUSTERED (RecordKey, GroupKey),
INDEX [IX dbo.Example SupergroupKey, GroupKey]
(SupergroupKey ASC, GroupKey ASC)
);
INSERT dbo.Example
(GroupKey, RecordKey)
VALUES
(1, 'Archimedes'),
(1, 'Newton'),
(1, 'Euler'),
(2, 'Euler'),
(2, 'Gauss'),
(3, 'Gauss'),
(3, 'Poincaré'),
(4, 'Ramanujan'),
(5, 'Neumann'),
(5, 'Grothendieck'),
(6, 'Grothendieck'),
(6, 'Tao');
Run Code Online (Sandbox Code Playgroud)
如果您能够反转当前主键的键顺序,则不需要额外的唯一索引。
该解决方案的方法是:
内联评论:
-- No execution plans or rows affected messages
SET NOCOUNT ON;
SET STATISTICS XML OFF;
-- Reset all supergroups
UPDATE E
SET SupergroupKey = 0
FROM dbo.Example AS E
WITH (TABLOCKX)
WHERE
SupergroupKey != 0;
DECLARE
@CurrentSupergroup integer = 0,
@CurrentGroup integer = 0;
WHILE 1 = 1
BEGIN
-- Next super group
SET @CurrentSupergroup += 1;
-- Find the lowest unprocessed group key
SELECT
@CurrentGroup = MIN(E.GroupKey)
FROM dbo.Example AS E
WHERE
E.SupergroupKey = 0;
-- Exit when no more unprocessed groups
IF @CurrentGroup IS NULL BREAK;
-- Set super group for all records in the current group
UPDATE E
SET E.SupergroupKey = @CurrentSupergroup
FROM dbo.Example AS E
WHERE
E.GroupKey = @CurrentGroup;
-- Iteratively find all groups for the super group
WHILE 1 = 1
BEGIN
WITH
RecordKeys AS
(
SELECT DISTINCT
E.RecordKey
FROM dbo.Example AS E
WHERE
E.SupergroupKey = @CurrentSupergroup
),
GroupKeys AS
(
SELECT DISTINCT
E.GroupKey
FROM RecordKeys AS RK
JOIN dbo.Example AS E
WITH (FORCESEEK)
ON E.RecordKey = RK.RecordKey
)
UPDATE E WITH (TABLOCKX)
SET SupergroupKey = @CurrentSupergroup
FROM GroupKeys AS GK
JOIN dbo.Example AS E
ON E.GroupKey = GK.GroupKey
WHERE
E.SupergroupKey = 0
OPTION (RECOMPILE, QUERYTRACEON 9481); -- The original CE does better
-- Break when no more related groups found
IF @@ROWCOUNT = 0 BREAK;
END;
END;
SELECT
E.SupergroupKey,
E.GroupKey,
E.RecordKey
FROM dbo.Example AS E;
Run Code Online (Sandbox Code Playgroud)
对于密钥更新:
表的最终状态是:
DROP TABLE IF EXISTS
dbo.Example;
CREATE TABLE dbo.Example
(
SupergroupKey integer NOT NULL
DEFAULT 0,
GroupKey integer NOT NULL,
RecordKey varchar(12) NOT NULL,
CONSTRAINT iExample
PRIMARY KEY CLUSTERED
(GroupKey ASC, RecordKey ASC),
CONSTRAINT [IX dbo.Example RecordKey, GroupKey]
UNIQUE NONCLUSTERED (RecordKey, GroupKey),
INDEX [IX dbo.Example SupergroupKey, GroupKey]
(SupergroupKey ASC, GroupKey ASC)
);
INSERT dbo.Example
(GroupKey, RecordKey)
VALUES
(1, 'Archimedes'),
(1, 'Newton'),
(1, 'Euler'),
(2, 'Euler'),
(2, 'Gauss'),
(3, 'Gauss'),
(3, 'Poincaré'),
(4, 'Ramanujan'),
(5, 'Neumann'),
(5, 'Grothendieck'),
(6, 'Grothendieck'),
(6, 'Tao');
Run Code Online (Sandbox Code Playgroud)
演示:db<>小提琴
使用 Michael Green's answer 中提供的扩展测试数据集,我的笔记本电脑* 上的时间是:
-- No execution plans or rows affected messages
SET NOCOUNT ON;
SET STATISTICS XML OFF;
-- Reset all supergroups
UPDATE E
SET SupergroupKey = 0
FROM dbo.Example AS E
WITH (TABLOCKX)
WHERE
SupergroupKey != 0;
DECLARE
@CurrentSupergroup integer = 0,
@CurrentGroup integer = 0;
WHILE 1 = 1
BEGIN
-- Next super group
SET @CurrentSupergroup += 1;
-- Find the lowest unprocessed group key
SELECT
@CurrentGroup = MIN(E.GroupKey)
FROM dbo.Example AS E
WHERE
E.SupergroupKey = 0;
-- Exit when no more unprocessed groups
IF @CurrentGroup IS NULL BREAK;
-- Set super group for all records in the current group
UPDATE E
SET E.SupergroupKey = @CurrentSupergroup
FROM dbo.Example AS E
WHERE
E.GroupKey = @CurrentGroup;
-- Iteratively find all groups for the super group
WHILE 1 = 1
BEGIN
WITH
RecordKeys AS
(
SELECT DISTINCT
E.RecordKey
FROM dbo.Example AS E
WHERE
E.SupergroupKey = @CurrentSupergroup
),
GroupKeys AS
(
SELECT DISTINCT
E.GroupKey
FROM RecordKeys AS RK
JOIN dbo.Example AS E
WITH (FORCESEEK)
ON E.RecordKey = RK.RecordKey
)
UPDATE E WITH (TABLOCKX)
SET SupergroupKey = @CurrentSupergroup
FROM GroupKeys AS GK
JOIN dbo.Example AS E
ON E.GroupKey = GK.GroupKey
WHERE
E.SupergroupKey = 0
OPTION (RECOMPILE, QUERYTRACEON 9481); -- The original CE does better
-- Break when no more related groups found
IF @@ROWCOUNT = 0 BREAK;
END;
END;
SELECT
E.SupergroupKey,
E.GroupKey,
E.RecordKey
FROM dbo.Example AS E;
Run Code Online (Sandbox Code Playgroud)
* Microsoft SQL Server 2017 (RTM-CU13),开发者版(64 位),Windows 10 专业版,16GB RAM,SSD,4 核超线程 i7,标称 2.4GHz。
递归 CTE 方法 - 在大表中可能效率极低:
WITH rCTE AS
(
-- Anchor
SELECT
GroupKey, RecordKey,
CAST('|' + CAST(GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS GroupKeys,
CAST('|' + CAST(RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS RecordKeys,
1 AS lvl
FROM Example
UNION ALL
-- Recursive
SELECT
e.GroupKey, e.RecordKey,
CASE WHEN r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
THEN CAST(r.GroupKeys + CAST(e.GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
ELSE r.GroupKeys
END,
CASE WHEN r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
THEN CAST(r.RecordKeys + CAST(e.RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
ELSE r.RecordKeys
END,
r.lvl + 1
FROM rCTE AS r
JOIN Example AS e
ON e.RecordKey = r.RecordKey
AND r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
--
OR e.GroupKey = r.GroupKey
AND r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
)
SELECT
ROW_NUMBER() OVER (ORDER BY GroupKeys) AS SuperGroupKey,
GroupKeys, RecordKeys
FROM rCTE AS c
WHERE NOT EXISTS
( SELECT 1
FROM rCTE AS m
WHERE m.lvl > c.lvl
AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
OR m.lvl = c.lvl
AND ( m.GroupKey > c.GroupKey
OR m.GroupKey = c.GroupKey
AND m.RecordKeys > c.RecordKeys
)
AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
AND c.GroupKeys LIKE '%|' + CAST(m.GroupKey AS VARCHAR(10)) + '|%'
)
OPTION (MAXRECURSION 0) ;
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
412 次 |
| 最近记录: |