Mar*_*lli 3 sql-server delete duplication sql-server-2014
我需要从大表中删除重复的行。实现这一目标的最佳方法是什么?
目前我使用这个算法:
declare @t table ([key] int )
insert into @t select 1
insert into @t select 1
insert into @t select 1
insert into @t select 2
insert into @t select 2
insert into @t select 3
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 6
insert into @t select 6
insert into @t select 6
insert into @t select 7
insert into @t select 7
insert into @t select 8
insert into @t select 8
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
select * from @t
; with cte as (
select *
, row_number() over (partition by [Key] order by [Key]) as Picker
from @t
)
delete cte
where Picker > 1
select * from @t
Run Code Online (Sandbox Code Playgroud)
当我在我的系统上运行它时:
;WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY AccountCode ) AS [Version]
FROM Stage.Customer
)
DELETE
FROM Customer
WHERE [Version] <> 1
Run Code Online (Sandbox Code Playgroud)
我发现 <> 1 比 > 1 更好。
我可以创建这个索引,目前不存在:
USE [BodenDWH]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [Stage].[Customer] ([AccountCode])
INCLUDE ([ID])
GO
Run Code Online (Sandbox Code Playgroud)
有没有其他方法可以完成这项工作?
在这种情况下,这张表并不大 - 实时系统上大约有 500,000 条记录。
删除是 SSIS 包的一部分,它每天运行,每天删除大约 10-15 条记录。
数据的结构方式存在问题,我只需要为每个客户提供一个 AccountCode,但可能存在重复项,如果不删除它们,它们会在稍后阶段破坏数据包。
不是我开发了这个包,我的范围不是重新设计任何东西。
我只是在寻找以最快的方式摆脱重复项的最佳方法,而无需参考索引创建或任何内容,只需 T-SQL 代码。
如果表很小并且您要删除的行数很小,则使用
;WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
FROM dbo.Customer
)
DELETE
FROM Customer
WHERE [Version] > 1;
Run Code Online (Sandbox Code Playgroud)
注意:在上面的查询中,您在 window order 子句中使用了任意排序ORDER BY (select null) (从Itzik Ben-Gan 的 T-SQL Querying book 中学习,@AaronBertrand 也引用了上面的内容)。
如果表很大(例如 5M 记录),则删除少量行或块将有助于不膨胀事务日志并防止锁升级。
当且仅当 Transact-SQL 语句在表的单个引用上获得至少 5000 个锁时,才会发生锁升级。
while 1=1
begin
WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
FROM dbo.Customer
)
DELETE top(4000) -- choose a lower batch size than 5000 to prevent lock escalation
FROM Customer
WHERE [Version] > 1
if @@ROWCOUNT < 4000
BREAK ;
end
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4088 次 |
| 最近记录: |