我有一个存储电子邮件的大表(约 678,000 行),我需要删除与电子邮件字段匹配的重复记录:收件人、发件人、主题、正文以及外键 record_id。
通常我会使用以下语句来删除重复项:
DELETE
FROM emails
WHERE email_id NOT IN (
SELECT MAX(email_id) FROM emails
GROUP BY record_id, from_add, to_add, subject, body)
Run Code Online (Sandbox Code Playgroud)
但是, body 是 datatype text
,因此无法比较/分组。它在选择查询上给出以下错误消息:
The text, ntext, and image data types cannot be compared or sorted, except when using IS NULL or LIKE operator.
Run Code Online (Sandbox Code Playgroud)
用text
数据类型识别重复项的最佳方法是什么?我是否必须varchar
先将列转换为?
你不具备对convert
实际列。只是它在SELECT
查询中的用法。
对于一次性清理任务,我可能会去
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY record_id,
from_add,
to_add,
subject,
CAST(body AS VARCHAR(MAX))
ORDER BY email_id DESC) RN
FROM emails)
DELETE FROM cte
WHERE RN > 1
Run Code Online (Sandbox Code Playgroud)
除非处理大量数据,这可能会使花时间在更有效的解决方案上(不必对巨大的字符串进行排序)值得。
或者你可以尝试
;WITH T
AS (SELECT MAX(email_id) AS max_email_id
FROM emails
GROUP BY record_id,
from_add,
to_add,
subject,
CAST(body AS VARCHAR(max)))
DELETE e1
FROM emails e1
WHERE NOT EXISTS (SELECT *
FROM T
WHERE max_email_id = e1.email_id)
Run Code Online (Sandbox Code Playgroud)
这可能会为您提供比NOT IN
SQL Server 视为MAX(not_nullable_column)
可能为空的更好的计划,因此您最终会得到一个可能更昂贵的计划。