跨 Bigquery 数组的不同计数

ala*_*lan 1 arrays google-bigquery

我想跨行连接数组,然后进行不同的计数。理想情况下,这将起作用:

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)
SELECT
  SUM(value) as total_value,
  ARRAY_LENGTH(ARRAY_CONCAT_AGG(DISTINCT key)) as unique_key_count
FROM test
Run Code Online (Sandbox Code Playgroud)

不幸的是,该ARRAY_CONCAT_AGG函数不支持DISTINCT运算符。我可以取消嵌套数组,但随后出现扇出并且 value 列的总和是错误的:

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)

SELECT
  SUM(value) as total_value,
  COUNT(DISTINCT k) as unique_key_count

FROM test
  CROSS JOIN UNNEST(key) k
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

有什么我遗漏的东西可以让我避免加入未嵌套的数组吗?

Ell*_*ard 5

这是一个替代方案:

CREATE TEMP FUNCTION DistinctCount(arr ANY TYPE) AS (
  (SELECT COUNT(DISTINCT x) FROM UNNEST(arr) AS x)
);

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)

SELECT
  SUM(value) as total_value,
  DistinctCount(ARRAY_CONCAT_AGG(key)) as unique_key_count
FROM test
Run Code Online (Sandbox Code Playgroud)

这避免了子查询或需要将数组与表连接(导致总和中的重复值)。


Mik*_*ant 5

以下是 BigQuery 标准 SQL

#standardSQL
WITH test AS
(
  SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT 
  total_value,
  COUNT(DISTINCT key) unique_key_count
FROM (
  SELECT
    SUM(value) AS total_value,
    ARRAY_CONCAT_AGG(key) AS all_keys
  FROM test
), UNNEST(all_keys) key
GROUP BY total_value  
Run Code Online (Sandbox Code Playgroud)

结果 :

Row total_value unique_key_count     
1   5           5     
Run Code Online (Sandbox Code Playgroud)

如果表中有相当多的行 - 您可以轻松地遇到内存/资源问题 - 在这种情况下,您可以尝试使用HyperLogLog++ 函数进行近似聚合 - 请参阅下面的示例

#standardSQL
WITH test AS
(
  SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT
  SUM(value) total_value,
  HLL_COUNT.MERGE((SELECT HLL_COUNT.INIT(key) FROM UNNEST(key) key)) AS unique_key_count
FROM test
Run Code Online (Sandbox Code Playgroud)

有结果

Row total_value unique_key_count     
1   5           5
Run Code Online (Sandbox Code Playgroud)

注意:这是近似聚合 - 所以要注意函数precision中的参数HLL_COUNT.INIT(input [, precision])