bigquery中比较两个表的有效方法

Nic*_*ers 2 sql ansi-sql google-bigquery bigquery-standard-sql

我有兴趣比较两个表是否包含相同的数据。

我可以这样做:

#standardSQL
SELECT
    key1, key2
FROM
(
    SELECT 
    table1.key1,
    table1.key2,
    table1.column1 - table2.column1 as col1,
    table1.col2 - table2.col2 as col2
    FROM
        `table1` AS table1
    LEFT JOIN
        `table2` AS table2
    ON
        table1.key1 = table2.key1
    AND
        table1.key2 = table2.key2
)
WHERE 
    col1 != 0
OR
    col2 != 0
Run Code Online (Sandbox Code Playgroud)

但是,当我想比较所有数字列时,这有点困难,尤其是当我想对多个表组合进行比较时。

因此,我的问题是:是否有人意识到有可能遍历所有数字列并将结果集限制为那些差异不为零的键?

Jor*_*eno 24

在标准 SQL 中,我们发现对我们的用例使用 a UNION ALLof twoEXCEPT DISTINCT的作品:

(
  SELECT * FROM table1
  EXCEPT DISTINCT
  SELECT * from table2
)

UNION ALL

(
  SELECT * FROM table2
  EXCEPT DISTINCT
  SELECT * from table1
)
Run Code Online (Sandbox Code Playgroud)

这将在两个方向上产生差异:

  • table1不在的行table2
  • table2不在的行table1

注意事项和注意事项:

  • table1并且table2必须具有相同的宽度并且具有相同顺序和类型的列。
  • 这不适用于STRUCTARRAY数据类型。您应该UNNEST,或使用TO_JSON_STRING的这些数据类型的先转换。
  • 这也不能直接GEOGRAPHY使用,您必须首先使用ST_AsText

  • 我修改为 `SELECT 'table1' AS Tbl, * FROM (SELECT ... EXCEPT DISTINCT ...) UNION ALL SELECT 'table2' AS Tbl, * FROM (SELECT ... EXCEPT DISTINCT ...)` 这得到了结果中也包含表源。顺便说一句——总体上喜欢这种方法 (8认同)

Mik*_*ant 6

首先,我想提出您原始查询的问题

主要问题是1)使用LEFT JOIN; 2)使用col!= 0

下面是应如何对其进行修改以真正捕获两个表中的所有差异
运行原始查询,然后在下面的查询中运行-希望您会看到差异

#standardSQL
SELECT key1, key2
FROM
(
    SELECT 
    IFNULL(table1.key1, table2.key1) key1,
    IFNULL(table1.key2, table2.key2) key2,
    table1.column1 - table2.column1 AS col1,
    table1.col2 - table2.col2 AS col2
    FROM `table1` AS table1
    FULL OUTER JOIN `table2` AS table2
    ON table1.key1 = table2.key1
    AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR    IFNULL(col2, 1) != 0
Run Code Online (Sandbox Code Playgroud)

或者您可以尝试针对虚拟数据运行原始版本和更高版本,以查看差异

#standardSQL
WITH `table1` AS (
  SELECT 1 key1, 1 key2, 1 column1, 2 col2 UNION ALL
  SELECT 2, 2, 3, 4 UNION ALL
  SELECT 3, 3, 5, 6
), `table2` AS (
  SELECT 1 key1, 1 key2, 1 column1, 29 col2 UNION ALL
  SELECT 2, 2, 3, 4 UNION ALL
  SELECT 4, 4, 7, 8
)
SELECT key1, key2
FROM
(
    SELECT 
    IFNULL(table1.key1, table2.key1) key1,
    IFNULL(table1.key2, table2.key2) key2,
    table1.column1 - table2.column1 AS col1,
    table1.col2 - table2.col2 AS col2
    FROM `table1` AS table1
    FULL OUTER JOIN `table2` AS table2
    ON table1.key1 = table2.key1
    AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR    IFNULL(col2, 1) != 0   
Run Code Online (Sandbox Code Playgroud)

其次,下面将大大简化您的整体查询

#standardSQL
SELECT 
  IFNULL(table1.key1, table2.key1) key1,
  IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING(table1) != TO_JSON_STRING(table2)  
Run Code Online (Sandbox Code Playgroud)

您可以使用与上述相同的虚拟数据示例进行测试。
注意:在此解决方案中,您无需选择特定的列-只需比较所有列即可!但是如果您只需要比较特定的列-您仍然需要像下面的示例一样挑选它们

#standardSQL
SELECT 
  IFNULL(table1.key1, table2.key1) key1,
  IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING((table1.column1, table1.col2)) != TO_JSON_STRING((table2.column1, table2.col2))
Run Code Online (Sandbox Code Playgroud)

  • 我找到了一篇关于此的 Medium 帖子,[BigQuery 表比较](https://medium.com/google-cloud/bigquery-table-comparison-cea802a3c64d),其中讨论了使用哈希的全表比较和按键比较。完整的表比较类似于此处提供的解决方案。 (2认同)