删除重复数据的最快技术

Question

删除重复数据的最快技术

O.O*_*O.O 7 sql sql-server etl sql-server-2008

在搜索了stackoverflow.com之后,我发现了几个问题,询问如何删除重复项,但没有一个能解决速度问题.

在我的情况下,我有一个包含10列的表,其中包含500万个确切的行重复项.另外,我在10列中的9列中至少有一百万行具有重复.我目前的技术是(到目前为止)3个小时删除这500万行.这是我的过程:

-- Step 1:  **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
    MAX(prikey) as MaxPriKey, -- identity(1, 1)
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
into #dupTemp
FROM sourceTable
group by
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
having COUNT(*) > 1

Run Code Online (Sandbox Code Playgroud)

下一个,

-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete 
from sourceTable
from sourceTable
inner join #dupTemp on  
    sourceTable.a = #dupTemp.a and
    sourceTable.b = #dupTemp.b and
    sourceTable.c = #dupTemp.c and
    sourceTable.d = #dupTemp.d and
    sourceTable.e   = #dupTemp.e and
    sourceTable.f = #dupTemp.f and
    sourceTable.g = #dupTemp.g and
    sourceTable.h = #dupTemp.h and
    sourceTable.i   = #dupTemp.i and
    sourceTable.PriKey != #dupTemp.MaxPriKey

Run Code Online (Sandbox Code Playgroud)

关于如何加快速度或更快速的方法的任何提示？请记住,对于不完全重复的行,我将不得不再次运行它.

非常感谢.

更新:
我不得不停止步骤2在9小时标记处运行.我尝试了OMG小马的方法,它只用了40分钟就结束了.我用Andomar的批量删除尝试了我的第2步,它在我停止之前运行了9个小时.更新:使用一个较少的字段进行类似的查询以删除一组不同的重复项,并且查询使用OMG Ponies方法仅运行4分钟(8000行).

我会在下一次机会尝试cte技术,但是,我怀疑OMG小马的方法很难被击败.

Answer 1

OMG*_*ies 5

EXISTS 怎么样：

DELETE FROM sourceTable
 WHERE EXISTS(SELECT NULL
                FROM #dupTemp dt
               WHERE sourceTable.a = dt.a 
                 AND sourceTable.b = dt.b 
                 AND sourceTable.c = dt.c 
                 AND sourceTable.d = dt.d 
                 AND sourceTable.e = dt.e 
                 AND sourceTable.f = dt.f 
                 AND sourceTable.g = dt.g 
                 AND sourceTable.h = dt.h 
                 AND sourceTable.i = dt.i 
                 AND sourceTable.PriKey < dt.MaxPriKey)

Run Code Online (Sandbox Code Playgroud)

归档时间：	15 年，6 月前
查看次数：	3293 次
最近记录：	15 年，6 月前