使用文本数据类型时删除重复记录

bd3*_*d33 6 sql-server-2005

我有一个存储电子邮件的大表(约 678,000 行),我需要删除与电子邮件字段匹配的重复记录:收件人、发件人、主题、正文以及外键 record_id。

通常我会使用以下语句来删除重复项:

DELETE
FROM emails
WHERE email_id NOT IN (
SELECT MAX(email_id) FROM emails
GROUP BY record_id, from_add, to_add, subject, body)
Run Code Online (Sandbox Code Playgroud)

但是, body 是 datatype text,因此无法比较/分组。它在选择查询上给出以下错误消息:

The text, ntext, and image data types cannot be compared or sorted, except when using IS NULL or LIKE operator.
Run Code Online (Sandbox Code Playgroud)

text数据类型识别重复项的最佳方法是什么?我是否必须varchar先将列转换为?

Mar*_*ith 5

你不具备convert实际列。只是它在SELECT查询中的用法。

对于一次性清理任务,我可能会去

;WITH cte
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY record_id, 
                                                from_add, 
                                                to_add, 
                                                subject, 
                                                CAST(body AS VARCHAR(MAX)) 
                                       ORDER BY email_id DESC) RN
         FROM   emails)
DELETE FROM cte
WHERE  RN > 1 
Run Code Online (Sandbox Code Playgroud)

除非处理大量数据,这可能会使花时间在更有效的解决方案上(不必对巨大的字符串进行排序)值得。

或者你可以尝试

;WITH T
     AS (SELECT MAX(email_id) AS max_email_id
         FROM   emails
         GROUP  BY record_id,
                   from_add,
                   to_add,
                   subject,
                   CAST(body AS VARCHAR(max)))
DELETE e1
FROM   emails e1
WHERE  NOT EXISTS (SELECT *
                   FROM   T
                   WHERE  max_email_id = e1.email_id) 
Run Code Online (Sandbox Code Playgroud)

这可能会为您提供比NOT INSQL Server 视为MAX(not_nullable_column)可能为空的更好的计划,因此您最终会得到一个可能更昂贵的计划

计划