我有一个存储电子邮件的大表(约 678,000 行),我需要删除与电子邮件字段匹配的重复记录:收件人、发件人、主题、正文以及外键 record_id。
通常我会使用以下语句来删除重复项:
DELETE
FROM emails
WHERE email_id NOT IN (
SELECT MAX(email_id) FROM emails
GROUP BY record_id, from_add, to_add, subject, body)
Run Code Online (Sandbox Code Playgroud)
但是, body 是 datatype text,因此无法比较/分组。它在选择查询上给出以下错误消息:
The text, ntext, and image data types cannot be compared or sorted, except when using IS NULL or LIKE operator.
Run Code Online (Sandbox Code Playgroud)
用text数据类型识别重复项的最佳方法是什么?我是否必须varchar先将列转换为?
你不具备对convert实际列。只是它在SELECT查询中的用法。
对于一次性清理任务,我可能会去
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY record_id,
from_add,
to_add,
subject,
CAST(body AS VARCHAR(MAX))
ORDER BY email_id DESC) RN
FROM emails)
DELETE FROM cte
WHERE RN > 1
Run Code Online (Sandbox Code Playgroud)
除非处理大量数据,这可能会使花时间在更有效的解决方案上(不必对巨大的字符串进行排序)值得。
或者你可以尝试
;WITH T
AS (SELECT MAX(email_id) AS max_email_id
FROM emails
GROUP BY record_id,
from_add,
to_add,
subject,
CAST(body AS VARCHAR(max)))
DELETE e1
FROM emails e1
WHERE NOT EXISTS (SELECT *
FROM T
WHERE max_email_id = e1.email_id)
Run Code Online (Sandbox Code Playgroud)
这可能会为您提供比NOT INSQL Server 视为MAX(not_nullable_column)可能为空的更好的计划,因此您最终会得到一个可能更昂贵的计划。
| 归档时间: |
|
| 查看次数: |
2298 次 |
| 最近记录: |