从大型（> 100 MIo）postgresql 表中删除重复行（根据条件截断？）

Question

从大型（> 100 MIo）postgresql 表中删除重复行（根据条件截断？）

像这里一样，我有一个大表，用于存储我们系统中的所有事件，对于一种事件类型，我有重复的行（多次错误地从另一个系统导出）。我需要删除它们以清除统计数据。上面提出的解决方案是

将记录（不重复）插入到临时表中，
截断原始表并将它们重新插入。

但在我的情况下，我只需要删除一类事件，而不是所有行，这对于truncate. 我想知道我是否可以从 postgres USING 语法中受益，就像这个SO 答案一样，它提供了以下解决方案 -

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

Run Code Online (Sandbox Code Playgroud)

问题是我在这个大表中没有 id 字段。那么在这种情况下最快的决定是什么？从临时表中删除+插入是唯一的选择吗？

Answer 1

小智 5

您可以使用该ctid列作为“替换 id”：

DELETE FROM user_accounts 
USING user_accounts ua2   
WHERE user_accounts.email = ua2.email 
  AND user_account.ctid < ua2.ctid;

Run Code Online (Sandbox Code Playgroud)

尽管这提出了另一个问题：为什么你的user_accounts表没有主键？

但是，如果您删除表中的大部分行，那么delete永远不会非常有效（并且比较ctid也不是很快，因为它没有索引）。所以这delete很可能需要很长时间。

对于一次性操作，如果您需要删除许多行，那么将要保留的行插入中间表会快得多。

可以通过简单地保留中间表而不是将行复制回原始表来改进该方法。

-- this will create the same table including indexes and not null constraint
-- but NOT foreign key constraints!
create table temp (like user_accounts including all);

insert into temp 
select distinct ... -- this is your query that removes the duplicates
from user_accounts;

 -- you might need cascade if the table is referenced by others
drop table user_accounts;

alter table temp rename to user_accounts;

commit;

Run Code Online (Sandbox Code Playgroud)

唯一的缺点是您必须为原始表重新创建外键（fks 引用原始表并将原始表的外键引用到另一个表）。

归档时间：	11 年，7 月前
查看次数：	2127 次
最近记录：	10 年，2 月前

从大型（&gt; 100 MIo）postgresql 表中删除重复行（根据条件截断？）

从大型（> 100 MIo）postgresql 表中删除重复行（根据条件截断？）