PostgreSQL:插入缺失值

Cap*_*hab 5 postgresql

我在 PostgreSQL 中有一个带有时间戳和值的表。
我想在“lat”下插入缺失的值。

“lat”下的值是基准面以上的潮汐高度。为此,可以在两个已知值之间线性插入缺失值。

在 PostgreSQL 中执行此操作的最佳方法是什么?

数据库摘录

编辑20200825

我使用 QGIS 字段计算器以不同的方式解决了这个问题。这种方法的问题是:它需要很长时间,并且该进程在客户端运行,我想直接在服务器上运行它。

分步骤来说,我的工作流程是:

  1. 记录的“lat”值之间的间隔为 10 分钟。我计算了两个记录值之间每分钟的增量,并将其存储在记录的“lat”值处名为“tidal_step”的额外列中。(我将时间戳也存储为列中的“纪元”)

在QGIS中:

tidal_step =
-- the lat value @ the epoch, 10 minutes or 600000 miliseconds from the current epoch: 
(attribute(get_feature('werkset','epoch',("epoch"+'600000')),'lat') -
-- the lat value @ the current
attribute(get_feature('werkset','epoch',"epoch"),'lat'))
/10
Run Code Online (Sandbox Code Playgroud)

对于示例图像中的前两个值,结果为: (4.95 - 5.07) /10 = -0.012

  1. 我确定了要插值的“lat”值的分钟数,超过了记录“lat”值的最后记录实例并将其存储在列中:“min_past_rec”

在QGIS中:

left(
right("timestamp",8) --this takes the timestamp and goes 8 charakters from the right
,1) -- this takes the string from the previous right( and goes 1 character left
Run Code Online (Sandbox Code Playgroud)

对于示例中的第一个值: 2019-01-01 00:15:15 返回: '5' 这是上次记录值过去 5 分钟的时间。

  1. 我通过将(“min_past_rec”*“tidal_step”)添加到最后记录的“lat”值来插入缺失值,并将其存储在名为“lat_interpolated”的列中

在QGIS中

CASE
WHEN "lat" = NULL 
THEN
-- minutes pas the last recorded instance:
("min_past_rec" *
-- the "tidal_step" at the last recorded "lat"-value:
(attribute(get_feature('werkset','epoch',
("epoch" - --the epoch of the "lat" value to be interpolated minus:

left(right("timestamp",8),1) * 600000 -- = the amount of minutes after the last recorded instance.
+ left(right("timestamp",6),2) * 1000) --  and the amount of seconds after the last recorded instance.
),'tidal_step')) +

-- the last recorded "lat"-value

(attribute(get_feature('werkset','epoch',("epoch" - left(right("timestamp",8),1) * 600000 + left(right("timestamp",6),2) * 1000)),'lat'))
Run Code Online (Sandbox Code Playgroud)

使用示例中的数据:

2019-01-01 00:17:33:

"lat_interpolated" = "min_past_rec" * "tidal_step" + "lat" = 
7*-0.012 + 4.95 = 4.866
Run Code Online (Sandbox Code Playgroud)
  1. 从数据库中删除过时的列

我应该在 PostgreSQL 中使用哪些语句/脚本来执行相同的任务?

Vér*_*ace 3

我有一个(部分)解决方案 - 我所做的如下(请参阅此处可用的小提琴):

我用于插值的算法是

  • 如果存在序列 1 NULL,则取上方值和下方值的平均值。

  • 一个 2 秒的序列NULL,顶部分配的值是其上方两条记录的平均值,底部分配的值是其下方两条记录的平均值。

为了做到这一点,我做了以下事情:

创建一个表:

CREATE TABLE data 
(
  s SERIAL PRIMARY KEY, 
  t TIMESTAMP, 
  lat NUMERIC
);
Run Code Online (Sandbox Code Playgroud)

用一些示例数据填充它:

INSERT INTO data (t, lat)
VALUES
('2019-01-01 00:00:00', 5.07),
('2019-01-01 01:00:00', 4.60),
('2019-01-01 02:00:00', NULL),
('2019-01-01 03:00:00', NULL),
('2019-01-01 04:00:00', 4.7),
('2019-01-01 05:00:00', 4.20),
('2019-01-01 06:00:00', NULL),
('2019-01-01 07:00:00', 4.98),
('2019-01-01 08:00:00', 4.50);
Run Code Online (Sandbox Code Playgroud)

请注意,记录 3、4 和 7 是NULL

然后我运行了第一个查询:

WITH cte1 AS
(
  SELECT d1.s,
    d1.t AS t1, d1.lat AS l1,
    LAG(d1.lat, 2)  OVER (ORDER BY t ASC) AS lag_t1_2,
    LAG(d1.lat, 1)  OVER (ORDER BY t ASC) AS lag_t1,
    LEAD(d1.lat, 1) OVER (ORDER BY t ASC) AS lead_t1,
    LEAD(d1.lat, 2) OVER (ORDER BY t ASC) AS lead_t1_2
  FROM data d1
),
cte2 AS
(
  SELECT 
    d2.t AS t2, d2.lat AS l2, 
    LAG(d2.lat, 1) OVER(ORDER BY t DESC) AS lag_t2,
    LEAD(d2.lat, 1) OVER(ORDER BY t DESC) AS lead_t2
  FROM data d2
),
cte3 AS
(
  SELECT t1.s,
    t1.t1,  t1.lag_t1_2, t1.lag_t1, t2.lag_t2, t1.l1, t2.l2, 
    t1.lead_t1, t2.lead_t2, t1.lead_t1_2
  FROM cte1 t1
  JOIN cte2 t2
  ON t1.t1 = t2.t2
)
SELECT * FROM cte3;
Run Code Online (Sandbox Code Playgroud)

结果(空格的意思NULL- 在小提琴上更清楚):

s   t1  lag_t1_2    lag_t1  lag_t2  l1  l2  lead_t1     lead_t2     lead_t1_2
1   2019-01-01 00:00:00             4.60    5.07    5.07    4.60        
2   2019-01-01 01:00:00         5.07        4.60    4.60        5.07    
3   2019-01-01 02:00:00     5.07    4.60                    4.60    4.7
4   2019-01-01 03:00:00     4.60        4.7             4.7         4.20
5   2019-01-01 04:00:00             4.20    4.7     4.7     4.20        
6   2019-01-01 05:00:00         4.7         4.20    4.20        4.7     4.98
7   2019-01-01 06:00:00     4.7     4.20    4.98            4.98    4.20    4.50
8   2019-01-01 07:00:00     4.20        4.50    4.98    4.98    4.50        
9   2019-01-01 08:00:00         4.98        4.50    4.50        4.98 
Run Code Online (Sandbox Code Playgroud)

LAG()请注意和LEAD()窗口函数 ( )的使用documentation。我在同一张桌子上使用过它们,但排序不同。

这和使用该OFFSET选项意味着,从我原来的单列lat,我现在有 6 个额外的“生成”数据列,这些数据对于将值分配给缺失NULL值非常有用。拼图的最后一个(部分)部分如下所示(完整的 SQL 查询位于本文的底部,也在小提琴中)。

cte4 AS
(
  SELECT t1.s,
  t1.l1 AS lat,
    CASE 
      WHEN (t1.l1 IS NOT NULL) THEN t1.l1
      WHEN (t1.l1 IS NULL) AND (t1.l2) IS NULL AND (t1.lag_t1 IS NOT NULL)
        AND (t1.lag_t2 IS NOT NULL) THEN ROUND((t1.lag_t1 + t1.lag_t2)/2, 2) 
      WHEN (t1.lag_t2 IS NULL) AND (t1.l1 IS NULL) AND (t1.l2 IS NULL) 
        AND (t1.lead_t1 IS NULL) THEN ROUND((t1.lag_t1 + t1.lag_t1_2)/2, 2)
      WHEN (t1.l1 IS NULL) AND (t1.l2 IS NULL)  AND (t1.lag_t1 IS NULL)
        AND (t1.lead_t2 IS NULL) THEN ROUND((t1.lead_t1 + t1.lead_t1_2)/2, 2)
      ELSE 0
    END AS final_val
  FROM cte3 t1
)
SELECT s, lat, final_val FROM cte4;
Run Code Online (Sandbox Code Playgroud)

最后结果:

s    lat    final_val
1   5.07         5.07
2   4.60         4.60
3   NULL         4.84
4   NULL         4.45
5   4.7           4.7
6   4.20         4.20
7   NULL         4.59
8   4.98         4.98
9   4.50         4.50
Run Code Online (Sandbox Code Playgroud)

因此,您可以看到记录 7 的计算值是记录 6 和 8 的平均值,记录 3 是记录 1 和 2 的平均值,记录 4 的分配值是记录 5 和 6 的平均值。这是通过以下方式启用的使用和函数OFFSET的选项。如果你得到 3 的序列,那么你将不得不使用3 的序列,依此类推。LAG()LEAD()NULLOFFSET

我对这个解决方案不太满意 - 它涉及对NULLs 的数量进行硬编码,并且这些CASE语句将变得更加复杂和可怕。RECURSIVE CTE理想情况下需要某种解决方案,但我很乐意!

================================= 完整查询 ================= =======

WITH cte1 AS
(
  SELECT d1.s,
    d1.t AS t1, d1.lat AS l1,
    LAG(d1.lat, 2)  OVER (ORDER BY t ASC) AS lag_t1_2,
    LAG(d1.lat, 1)  OVER (ORDER BY t ASC) AS lag_t1,
    LEAD(d1.lat, 1) OVER (ORDER BY t ASC) AS lead_t1,
    LEAD(d1.lat, 2) OVER (ORDER BY t ASC) AS lead_t1_2
  FROM data d1
),
cte2 AS
(
  SELECT 
    d2.t AS t2, d2.lat AS l2, 
    LAG(d2.lat, 1) OVER(ORDER BY t DESC) AS lag_t2,
    LEAD(d2.lat, 1) OVER(ORDER BY t DESC) AS lead_t2
  FROM data d2
),
cte3 AS
(
  SELECT t1.s,
    t1.t1,  t1.lag_t1_2, t1.lag_t1, t2.lag_t2, t1.l1, t2.l2, 
    t1.lead_t1, t2.lead_t2, t1.lead_t1_2
  FROM cte1 t1
  JOIN cte2 t2
  ON t1.t1 = t2.t2
),
cte4 AS
(
  SELECT t1.s,
  t1.l1 AS lat,
    CASE 
      WHEN (t1.l1 IS NOT NULL) THEN t1.l1
      WHEN (t1.l1 IS NULL) AND (t1.l2) IS NULL AND (t1.lag_t1 IS NOT NULL)
        AND (t1.lag_t2 IS NOT NULL) THEN ROUND((t1.lag_t1 + t1.lag_t2)/2, 2) 
      WHEN (t1.lag_t2 IS NULL) AND (t1.l1 IS NULL) AND (t1.l2 IS NULL) 
        AND (t1.lead_t1 IS NULL) THEN ROUND((t1.lag_t1 + t1.lag_t1_2)/2, 2)
      WHEN (t1.l1 IS NULL) AND (t1.l2 IS NULL)  AND (t1.lag_t1 IS NULL)
        AND (t1.lead_t2 IS NULL) THEN ROUND((t1.lead_t1 + t1.lead_t1_2)/2, 2)
      ELSE 0
    END AS final_val,
    t1.lead_t1_2
  FROM cte3 t1
)
SELECT s, lat, final_val, lead_t1_2 FROM cte4;
Run Code Online (Sandbox Code Playgroud)