SQL数组展平:为什么CROSS JOIN UNNEST不将每个嵌套值与每一行联接在一起?

con*_*lee 13 sql google-bigquery

这个问题不是要解决特定的问题,而是要了解用于平整数组的通用SQL习语中幕后实际发生的情况。幕后有一些魔术,我想在语法糖的幕后窥视一下,看看发生了什么。

让我们考虑下表t1

11

现在假设我们有一个函数调用FLATTEN了一个类型为array的列,并对该列中的每个数组进行解包,以便为每个数组中的每个值留一行-如果运行SELECT FLATTEN(numbers_array) AS flattened_numbers FROM t1,我们期望以下,我们称之为t2

t2

在SQL中,CROSS JOIN通过将第一个表中的每一行与第二个表中的每一行进行组合来组合两个表中的行。所以如果我们跑步SELECT id, flattened.flattened_numbers from t1 CROSS JOIN flattened,我们得到

在此处输入图片说明

现在,flatten只是一个虚构的函数,您可以看到将其与CROSS JOIN结合起来并不是很有用,因为该id列的每个原始值都与flattened_numbers每个原始行混合在一起。因为我们没有一个WHERE子句只选择CROSS JOIN想要的行,所以一切都变得混乱了。

该模式中,人们实际上使用扁平化阵列看起来像这样: SELECT id, flattened_numbers FROM t1 CROSS JOIN UNNEST(sequences.some_numbers) AS flattened_numbers,产生

在此处输入图片说明

但我不明白该CROSS JOIN UNNEST模式为何有效。因为CROSS JOIN不包含WHERE子句,所以我希望它的行为就像FLATTEN我上面概述的函数一样,其中每个未嵌套的值都与的每一行合并t1

有人可以“解包” CROSS JOIN UNNEST模式中实际发生的情况吗,该模式可确保每行仅与其自身的嵌套值(而不与其他行的嵌套值)结合在一起?

Ell*_*ard 11

The best way to think about this is by looking at what happens on a row-by-row basis. Setting up some input data, we have:

WITH t1 AS (
  SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL
  SELECT 2, [2, 4, 5]
)
...
Run Code Online (Sandbox Code Playgroud)

(I'm using a third element for the second row to make things more interesting). If we just select from it, we get output that looks like this:

WITH t1 AS (
  SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL
  SELECT 2, [2, 4, 5]
)
SELECT * FROM t1;
+----+---------------+
| id | numbers_array |
+----+---------------+
| 1  | [0, 1]        |
| 2  | [2, 4, 5]     |
+----+---------------+
Run Code Online (Sandbox Code Playgroud)

Now let's talk about unnesting. The UNNEST function takes an array and returns a value table of the array's element type. Whereas most BigQuery tables are SQL tables defined as a collection of columns, a value table has rows of some value type. For numbers_array, UNNEST(numbers_array) returns a value table whose value type is INT64, since numbers_array is an array with an element type of INT64. This value table contains all of the elements in numbers_array for the current row from t1.

For the row with an id of 1, the contents of the value table returned by UNNEST(numbers_array) are:

+-----+
| f0_ |
+-----+
| 0   |
| 1   |
+-----+
Run Code Online (Sandbox Code Playgroud)

This is the same as what we get with the following query:

SELECT * FROM UNNEST([0, 1]);
Run Code Online (Sandbox Code Playgroud)

UNNEST([0, 1]) in this case means "create a value table from the INT64 values 0 and 1".

Similarly, for the row with an id of 2, the contents of the value table returned by UNNEST(numbers_array) are:

+-----+
| f0_ |
+-----+
| 2   |
| 4   |
| 5   |
+-----+
Run Code Online (Sandbox Code Playgroud)

Now let's talk about how CROSS JOIN fits into the picture. In most cases, you use CROSS JOIN between two uncorrelated tables. In other words, the contents of the table on the right of the CROSS JOIN are not defined by the current contents of the table on the left.

In the case of arrays and UNNEST, however, the contents of the value table produced by UNNEST(numbers_array) change depending on the current row of t1. When we join the two tables, we get the cross product of the current row from t1 with all of the rows from UNNEST(numbers_array). For example:

WITH t1 AS (
  SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL
  SELECT 2, [2, 4, 5]
)
SELECT id, number
FROM t1
CROSS JOIN UNNEST(numbers_array) AS number;
+----+--------+
| id | number |
+----+--------+
| 1  | 0      |
| 1  | 1      |
| 2  | 2      |
| 2  | 4      |
| 2  | 5      |
+----+--------+
Run Code Online (Sandbox Code Playgroud)

numbers_array has two elements in the first row and three elements in the second, so we get 2 + 3 = 5 rows in the result of the query.

要回答有关将其展平numbers_array然后执行的区别CROSS JOIN,我们来看一下此查询的结果:

WITH t1 AS (
  SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL
  SELECT 2, [2, 4, 5]
), t2 AS (
  SELECT number
  FROM t1
  CROSS JOIN UNNEST(numbers_array) AS number
)
SELECT number
FROM t2;
+--------+
| number |
+--------+
| 0      |
| 1      |
| 2      |
| 4      |
| 5      |
+--------+
Run Code Online (Sandbox Code Playgroud)

在这种情况下,t2是一个SQL表,其中包含以number这些值命名的列。如果执行CROSS JOIN介于t1和之间,则将t2得到所有行的真实叉积:

WITH t1 AS (
  SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL
  SELECT 2, [2, 4, 5]
), t2 AS (
  SELECT number
  FROM t1
  CROSS JOIN UNNEST(numbers_array) AS number
)
SELECT id, numbers_array, number
FROM t1
CROSS JOIN t2;
+----+---------------+--------+
| id | numbers_array | number |
+----+---------------+--------+
| 1  | [0, 1]        | 0      |
| 1  | [0, 1]        | 1      |
| 1  | [0, 1]        | 2      |
| 1  | [0, 1]        | 4      |
| 1  | [0, 1]        | 5      |
| 2  | [2, 4, 5]     | 0      |
| 2  | [2, 4, 5]     | 1      |
| 2  | [2, 4, 5]     | 2      |
| 2  | [2, 4, 5]     | 4      |
| 2  | [2, 4, 5]     | 5      |
+----+---------------+--------+
Run Code Online (Sandbox Code Playgroud)

那么,此查询与上一个查询之间有什么区别CROSS JOIN UNNEST(numbers_array)?在这种情况下,t2从的每一行的内容都不会改变t1。对于中的第一行t1,中有五行t2。对于第二行t1,在中有五行t2。结果,CROSS JOIN它们两个之间的5 + 5 = 10总计返回行。

  • 另外,FWIW:我是那种希望在官方文档中对此有一点说明的用户,但我可能是少数。尽管我喜欢 CROSS JOIN UNNEST 成语,但我发现当好的旧交叉连接由于未知原因而表现得与预期不同时,我会感到困惑。 (4认同)
  • 感谢您的良好解释。我以前从未听说过“值表”,如果我用谷歌搜索“SQL 值表”,我找不到像您一样使用该术语的页面。这是 BigQuery 特有的想法吗? (3认同)
  • 我们正在努力 :) 我最近一直在审查更改以彻底检查我们如何解释这些概念。 (2认同)
  • 无关紧要的问题,但如何获得“+、-、|”内表格格式整齐的结果 结构? (2认同)
  • 这不是近期的事情,但是请注意[ZetaSQL](https://github.com/google/zetasql),因为我认为一旦发布参考实现,它将以这种方式包括格式化结果。 (2认同)