SQL Server - 从子查询/派生表中删除

Jua*_*lez 3 sql-server

有没有办法可以把下面的SELECT语句变成一个DELETE

I would like to delete the corresponding returned records from the [ETL].[Stage_Claims] table.

Since I used derived tables I can't reference the Stage_Claims table.

To summarize, the 2 physical tables used in the below query have identical structures. Only difference is DUPS_Claims is a subset of Stage_Claims.

DUPS_Claims contains duplicate records found in Stage_Claims. If a record exists 3 times in Stage_Claims, we will have that record 3 times in DUPS_Claims as well.

Stage_Claims contains all records including the duplicate records in DUPS_Claims.

I would like to remove the duplicate records from Stage_Claims leaving only 1 unique record for every duplicated record.

Stage_Claims has just short of 1 million rows so I do not want to use Row_Number / Partition on the entire table as it takes over 2 minutes to run.

The below query I have runs in about 15 seconds and successfully identifies only the duplicate records (not including the original unique record we want to keep) but I have not been able to figure out how to delete the records that are returned from SC.

Is it possible or should I just take a different approach?

SELECT *
FROM (
    SELECT RN = ROW_NUMBER() OVER (
            PARTITION BY SC.ID ORDER BY SC.id
            )
        ,SC.*
    FROM [ETL].[Stage_Claims] SC
    WHERE ID IN (
            SELECT ID
            FROM (
                SELECT RN = ROW_NUMBER() OVER (
                        PARTITION BY ID ORDER BY id
                        )
                    ,ID
                FROM [ETL].[DUPS_Claims]
                ) AS t1
            WHERE RN > 1
            )
    ) AS t2
WHERE RN > 1
Run Code Online (Sandbox Code Playgroud)

Han*_*non 12

Convert your select statement into a CTE, and DELETE FROM the CTE, as in:

;WITH del AS
(
SELECT *
FROM (
    SELECT RN = ROW_NUMBER() OVER (
            PARTITION BY SC.ID ORDER BY SC.id
            )
        ,SC.*
    FROM [ETL].[Stage_Claims] SC
    WHERE ID IN (
            SELECT ID
            FROM (
                SELECT RN = ROW_NUMBER() OVER (
                        PARTITION BY ID ORDER BY id
                        )
                    ,ID
                FROM [ETL].[DUPS_Claims]
                ) AS t1
            WHERE RN > 1
            )
    ) AS t2
WHERE RN > 1
)
DELETE FROM del;
Run Code Online (Sandbox Code Playgroud)

Standard warning: You should test this in a non-production environment.

You can simplify your query quite a bit, and likely get better performance by using the below query, which does not make use of the intermediate table, DUPS_Claims, since it is absolutely unnecessary:

;WITH cte AS
(
    SELECT sc.ID
        , rn = ROW_NUMBER() OVER (PARTITION BY sc.ID ORDER BY sc.ID)
    FROM ETL.Stage_Claims sc
)
DELETE
FROM cte 
WHERE rn > 1;
Run Code Online (Sandbox Code Playgroud)

I created a non-clustered, non-unique index on both tables, then looked at the execution plans for both variations.

The first variant:

在此处输入图片说明

The second variant:

在此处输入图片说明

第一个变体扫描索引两次,而第二个变体显然只需要扫描一次索引,并且在我有些人为的示例中不需要相对昂贵的合并连接。我的示例ETL.Stage_Claims表包含 89 个唯一ID值,每个值重复 89 次,总共 7921 行。

如果 CTE 不是你的东西,你可以使用这种方法从派生表中删除,而不是:

DELETE c
FROM (
    SELECT sc.ID
        , rn = ROW_NUMBER() OVER (PARTITION BY sc.ID ORDER BY sc.ID)
    FROM ETL.Stage_Claims sc
) c
WHERE rn > 1;
Run Code Online (Sandbox Code Playgroud)

以上DELETE来自派生表的查询计划:

在此处输入图片说明