ala*_*lan 1 arrays google-bigquery
我想跨行连接数组,然后进行不同的计数。理想情况下,这将起作用:
WITH test AS
(
SELECT
DATE('2018-01-01') as date,
2 as value,
[1,2,3] as key
UNION ALL
SELECT
DATE('2018-01-02') as date,
3 as value,
[1,4,5] as key
)
SELECT
SUM(value) as total_value,
ARRAY_LENGTH(ARRAY_CONCAT_AGG(DISTINCT key)) as unique_key_count
FROM test
Run Code Online (Sandbox Code Playgroud)
不幸的是,该ARRAY_CONCAT_AGG函数不支持DISTINCT运算符。我可以取消嵌套数组,但随后出现扇出并且 value 列的总和是错误的:
WITH test AS
(
SELECT
DATE('2018-01-01') as date,
2 as value,
[1,2,3] as key
UNION ALL
SELECT
DATE('2018-01-02') as date,
3 as value,
[1,4,5] as key
)
SELECT
SUM(value) as total_value,
COUNT(DISTINCT k) as unique_key_count
FROM test
CROSS JOIN UNNEST(key) k
Run Code Online (Sandbox Code Playgroud)
有什么我遗漏的东西可以让我避免加入未嵌套的数组吗?
这是一个替代方案:
CREATE TEMP FUNCTION DistinctCount(arr ANY TYPE) AS (
(SELECT COUNT(DISTINCT x) FROM UNNEST(arr) AS x)
);
WITH test AS
(
SELECT
DATE('2018-01-01') as date,
2 as value,
[1,2,3] as key
UNION ALL
SELECT
DATE('2018-01-02') as date,
3 as value,
[1,4,5] as key
)
SELECT
SUM(value) as total_value,
DistinctCount(ARRAY_CONCAT_AGG(key)) as unique_key_count
FROM test
Run Code Online (Sandbox Code Playgroud)
这避免了子查询或需要将数组与表连接(导致总和中的重复值)。
以下是 BigQuery 标准 SQL
#standardSQL
WITH test AS
(
SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT
total_value,
COUNT(DISTINCT key) unique_key_count
FROM (
SELECT
SUM(value) AS total_value,
ARRAY_CONCAT_AGG(key) AS all_keys
FROM test
), UNNEST(all_keys) key
GROUP BY total_value
Run Code Online (Sandbox Code Playgroud)
结果 :
Row total_value unique_key_count
1 5 5
Run Code Online (Sandbox Code Playgroud)
如果表中有相当多的行 - 您可以轻松地遇到内存/资源问题 - 在这种情况下,您可以尝试使用HyperLogLog++ 函数进行近似聚合 - 请参阅下面的示例
#standardSQL
WITH test AS
(
SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT
SUM(value) total_value,
HLL_COUNT.MERGE((SELECT HLL_COUNT.INIT(key) FROM UNNEST(key) key)) AS unique_key_count
FROM test
Run Code Online (Sandbox Code Playgroud)
有结果
Row total_value unique_key_count
1 5 5
Run Code Online (Sandbox Code Playgroud)
注意:这是近似聚合 - 所以要注意函数precision中的参数HLL_COUNT.INIT(input [, precision])