删除重复行的最快方法是什么?

Mar*_*lli 3 sql-server delete duplication sql-server-2014

我需要从大表中删除重复的行。实现这一目标的最佳方法是什么?

目前我使用这个算法:

declare @t table ([key] int  )

insert into @t select 1
insert into @t select 1
insert into @t select 1
insert into @t select 2
insert into @t select 2
insert into @t select 3
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 6
insert into @t select 6
insert into @t select 6
insert into @t select 7
insert into @t select 7
insert into @t select 8
insert into @t select 8
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9


select * from @t

; with cte as (
    select *
        , row_number() over (partition by [Key] order by [Key]) as Picker
    from @t
    )
delete cte 
where Picker > 1

select * from @t
Run Code Online (Sandbox Code Playgroud)

当我在我的系统上运行它时:

;WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY AccountCode ) AS [Version]
    FROM Stage.Customer
    )
    DELETE
    FROM    Customer
    WHERE   [Version] <> 1
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

我发现 <> 1 比 > 1 更好。

我可以创建这个索引,目前不存在:

USE [BodenDWH]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [Stage].[Customer] ([AccountCode])
INCLUDE ([ID])
GO
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

有没有其他方法可以完成这项工作?

在这种情况下,这张表并不大 - 实时系统上大约有 500,000 条记录。

删除是 SSIS 包的一部分,它每天运行,每天删除大约 10-15 条记录。

数据的结构方式存在问题,我只需要为每个客户提供一个 AccountCode,但可能存在重复项,如果不删除它们,它们会在稍后阶段破坏数据包。

不是我开发了这个包,我的范围不是重新设计任何东西。

我只是在寻找以最快的方式摆脱重复项的最佳方法,而无需参考索引创建或任何内容,只需 T-SQL 代码。

Kin*_*hah 5

如果表很小并且您要删除的行数很小,则使用

;WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
    FROM dbo.Customer
    )
    DELETE
    FROM    Customer
    WHERE   [Version] > 1;
Run Code Online (Sandbox Code Playgroud)

注意:在上面的查询中,您在 window order 子句中使用了任意排序ORDER BY (select null) (从Itzik Ben-Gan 的 T-SQL Querying book 中学习,@AaronBertrand 也引用了上面的内容)

如果表很大(例如 5M 记录),则删除少量行或块将有助于不膨胀事务日志并防止锁升级

当且仅当 Transact-SQL 语句在表的单个引用上获得至少 5000 个锁时,才会发生锁升级。

while 1=1
begin
WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
    FROM dbo.Customer
    )
    DELETE top(4000) -- choose a lower batch size than 5000 to prevent lock escalation 
    FROM    Customer
    WHERE   [Version] > 1

    if @@ROWCOUNT < 4000
    BREAK ;

end
Run Code Online (Sandbox Code Playgroud)