BigQuery:删除分区表中的重复项

Sha*_*vim 6 google-bigquery bigquery-standard-sql

我有按插入时间划分的BQ表。我正在尝试从表中删除重复项。这些是真正的重复项:对于2个重复的行,所有列均相等-当然,使用唯一键可能会有所帮助:-(

最初,我尝试使用SELECT查询来枚举重复项并将其删除:

SELECT
    * EXCEPT(row_number)
FROM (
    SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id_column) row_number
    FROM
    `mytable`)
WHERE
    row_number = 1
Run Code Online (Sandbox Code Playgroud)

这将导致唯一的行,但会创建一个不包含分区数据的新表-不好。

我在这里看到了这个答案,它说明了保留分区的唯一方法是通过上述查询一个一个地遍历它们并保存到特定的目标表分区。

我真正想做的是使用DML DELETE删除适当的重复行。我尝试了类似于此答案建议的内容

DELETE
FROM `mytable` AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id_column)
   FROM `mytable ` AS d2
   WHERE d.id = d2.id) > 1;
Run Code Online (Sandbox Code Playgroud)

但是接受的答案无效,并导致BQ错误:

Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN

如果有人可以提供一种更简单的方式(DML或其他方式)来解决此问题,那将是很好的,这样就不需要我逐个循环所有分区。

Ell*_*ard 7

有点骇人听闻,但您可以使用该MERGE语句删除表的所有内容,并原子地仅重新插入不同的行。这是一个例子:

-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));
Run Code Online (Sandbox Code Playgroud)

现在开始MERGE

-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN DELETE
Run Code Online (Sandbox Code Playgroud)