Nei*_*eil 15 sql sql-delete amazon-redshift
我试图删除我的redshift表中的一些重复数据.
以下是我的查询: -
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;
Run Code Online (Sandbox Code Playgroud)
这个查询给了我一个错误.
Amazon无效操作:语法错误在"删除"或附近;
不确定问题是什么,因为with子句的语法似乎是正确的.以前有人遇到过这种情况吗?
sys*_*ack 19
Redshift就是它的样子(任何专栏都没有强制执行的唯一性),Ziggy的第三选择可能是最好的.一旦我们决定采用临时表路线,就可以更有效地将事情全部换掉.在Redshift中删除和插入是昂贵的.
begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;
Run Code Online (Sandbox Code Playgroud)
如果空间不是问题,您可以将旧表保留一段时间,并使用此处描述的其他方法来验证重复项的原始记帐中的行计数是否与新计数中的行计数相匹配.
如果你正在对这样的表进行持续加载,那么你将需要在这个过程中暂停该过程.
如果重复项的数量占大表的一小部分,则可能需要尝试将重复项的不同记录复制到临时表,然后从与temp连接的原始文件中删除所有记录.然后将临时表附加回原始表.确保在之后清空原始表(无论如何,您应该按计划对大表执行此操作).
Ell*_*nce 13
如果你处理大量数据,重建整个表并不总是可行或聪明的.找到,删除这些行可能更容易:
-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;
-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);
INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;
-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;
COMMIT;
Run Code Online (Sandbox Code Playgroud)
全文:https://elliot.land/post/removing-duplicate-data-in-redshift
Jai*_*Jai 10
original_table.CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
Run Code Online (Sandbox Code Playgroud)
original_tableCREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
Run Code Online (Sandbox Code Playgroud)
original_tableTRUNCATE original_table;
Run Code Online (Sandbox Code Playgroud)
unique_table到original_tableINSERT INTO original_table
(
SELECT * FROM unique_table
)
;
Run Code Online (Sandbox Code Playgroud)
BEGIN transaction;
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
DELETE FROM original_table;
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
END transaction;
Run Code Online (Sandbox Code Playgroud)
那应该起作用。您可以选择的替代方法:
With
duplicates As (
Select *, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);
Run Code Online (Sandbox Code Playgroud)
要么
delete from table_name
where id in (
select id
from (
Select id, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name) x
Where Duplicate > 1);
Run Code Online (Sandbox Code Playgroud)
如果没有主键,则可以执行以下操作:
BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
SELECT DISTINCT ON (record_indicator) *
FROM table_name
ORDER BY record_indicator --, other_optional_priority_field DESC
;
DELETE FROM table_name
WHERE record_indicator IN (
SELECT record_indicator FROM mydups);
INSERT INTO table_name SELECT * FROM mydups;
COMMIT;
Run Code Online (Sandbox Code Playgroud)
小智 6
这个问题的简单回答:
row_number=1.delete主表中所有有重复项的行。查询:
临时表
select id,date into #temp_a
from
(select *
from (select a.*,
row_number() over(partition by id order by etl_createdon desc) as rn
from table a
where a.id between 59 and 75 and a.date = '2018-05-24')
where rn =1)a
从主表中删除所有行。
delete from table a
where a.id between 59 and 75 and a.date = '2018-05-24'
将临时表中的所有值插入主表
insert into table a select * from #temp_a。