用缺失日期的数据填充表(postgresql、redshift)

D.D*_*glo 6 postgresql amazon-web-services amazon-redshift gaps-in-data

我正在尝试填写缺失日期的每日数据,但找不到答案,请帮忙。

我的daily_table例子:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          
Run Code Online (Sandbox Code Playgroud)

预期结果:我想用每个域和每天的数据填充此表,这些数据只是复制以前的数据date

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-14    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-15    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-16    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |          
Run Code Online (Sandbox Code Playgroud)

我可以将部分逻辑移至 php 中,但这是不可取的,因为我的表有数十亿个缺失日期。

概括:

最近几天我发现:

  1. Amazon-redshift 使用 PostgreSql 第 8 版,这就是为什么它不支持这样漂亮的命令JOIN LATERAL
  2. Redshift也不支持generate_seriesCTEs
  3. 但它支持简单WITH(谢谢@systemjack)但WITH RECURSIVE不支持

D.D*_*glo 6

终于,我完成了我的任务,我想分享一些有用的东西。

我没有generate_series使用这个钩子:

WITH date_range AS (
  SELECT trunc(current_date - (row_number() OVER ())) AS date
  FROM any_table  -- any of your table which has enough data
  LIMIT 365
) SELECT * FROM date_range;
Run Code Online (Sandbox Code Playgroud)

为了获取必须用数据填充的 URL 列表,我使用了以下命令:

WITH url_list AS (
  SELECT
    url AS gapsed_url,
    MIN(timestamp_gmt) AS min_date,
    MAX(timestamp_gmt) AS max_date
  FROM daily_table
  WHERE url IN (
    SELECT url FROM daily_table GROUP BY url
    HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
  )
  GROUP BY url
) SELECT * FROM url_list;
Run Code Online (Sandbox Code Playgroud)

然后我组合给定的数据,我们称之为url_mapping

SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;
Run Code Online (Sandbox Code Playgroud)

为了按最近的日期获取数据,我执行了以下操作:

SELECT sd.*
FROM url_mapping AS um JOIN daily_table AS sd
ON um.gapsed_url = sd.url AND (
  sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
)
Run Code Online (Sandbox Code Playgroud)

我希望它能帮助某人。