删除重复行的最快方法是什么？

Question

删除重复行的最快方法是什么？

Mar*_*lli 3 sql-server delete duplication sql-server-2014

我需要从大表中删除重复的行。实现这一目标的最佳方法是什么？

目前我使用这个算法：

declare @t table ([key] int  )

insert into @t select 1
insert into @t select 1
insert into @t select 1
insert into @t select 2
insert into @t select 2
insert into @t select 3
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 6
insert into @t select 6
insert into @t select 6
insert into @t select 7
insert into @t select 7
insert into @t select 8
insert into @t select 8
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9


select * from @t

; with cte as (
    select *
        , row_number() over (partition by [Key] order by [Key]) as Picker
    from @t
    )
delete cte 
where Picker > 1

select * from @t

Run Code Online (Sandbox Code Playgroud)

当我在我的系统上运行它时：

;WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY AccountCode ) AS [Version]
    FROM Stage.Customer
    )
    DELETE
    FROM    Customer
    WHERE   [Version] <> 1

Run Code Online (Sandbox Code Playgroud)

我发现 <> 1 比 > 1 更好。

我可以创建这个索引，目前不存在：

USE [BodenDWH]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [Stage].[Customer] ([AccountCode])
INCLUDE ([ID])
GO

Run Code Online (Sandbox Code Playgroud)

有没有其他方法可以完成这项工作？

在这种情况下，这张表并不大 - 实时系统上大约有 500,000 条记录。

删除是 SSIS 包的一部分，它每天运行，每天删除大约 10-15 条记录。

数据的结构方式存在问题，我只需要为每个客户提供一个 AccountCode，但可能存在重复项，如果不删除它们，它们会在稍后阶段破坏数据包。

不是我开发了这个包，我的范围不是重新设计任何东西。

我只是在寻找以最快的方式摆脱重复项的最佳方法，而无需参考索引创建或任何内容，只需 T-SQL 代码。

Answer 1

Kin*_*hah 5

如果表很小并且您要删除的行数很小，则使用

;WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
    FROM dbo.Customer
    )
    DELETE
    FROM    Customer
    WHERE   [Version] > 1;

Run Code Online (Sandbox Code Playgroud)

注意：在上面的查询中，您在 window order 子句中使用了任意排序ORDER BY (select null) （从Itzik Ben-Gan 的 T-SQL Querying book 中学习，@AaronBertrand 也引用了上面的内容）。

如果表很大（例如 5M 记录），则删除少量行或块将有助于不膨胀事务日志并防止锁升级。

当且仅当 Transact-SQL 语句在表的单个引用上获得至少 5000 个锁时，才会发生锁升级。

while 1=1
begin
WITH Customer AS
    (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
    FROM dbo.Customer
    )
    DELETE top(4000) -- choose a lower batch size than 5000 to prevent lock escalation 
    FROM    Customer
    WHERE   [Version] > 1

    if @@ROWCOUNT < 4000
    BREAK ;

end

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，1 月前
查看次数：	4088 次
最近记录：	9 年，7 月前