我在 PostgreSQL 中有一个带有时间戳和值的表。
我想在“lat”下插入缺失的值。
“lat”下的值是基准面以上的潮汐高度。为此,可以在两个已知值之间线性插入缺失值。
在 PostgreSQL 中执行此操作的最佳方法是什么?
编辑20200825
我使用 QGIS 字段计算器以不同的方式解决了这个问题。这种方法的问题是:它需要很长时间,并且该进程在客户端运行,我想直接在服务器上运行它。
分步骤来说,我的工作流程是:
在QGIS中:
tidal_step =
-- the lat value @ the epoch, 10 minutes or 600000 miliseconds from the current epoch:
(attribute(get_feature('werkset','epoch',("epoch"+'600000')),'lat') -
-- the lat value @ the current
attribute(get_feature('werkset','epoch',"epoch"),'lat'))
/10
Run Code Online (Sandbox Code Playgroud)
对于示例图像中的前两个值,结果为: (4.95 - 5.07) /10 = -0.012
在QGIS中:
left(
right("timestamp",8) --this takes the timestamp and goes 8 charakters from the right
,1) -- this takes the string from the previous right( and goes 1 character left
Run Code Online (Sandbox Code Playgroud)
对于示例中的第一个值: 2019-01-01 00:15:15 返回: '5' 这是上次记录值过去 5 分钟的时间。
在QGIS中
CASE
WHEN "lat" = NULL
THEN
-- minutes pas the last recorded instance:
("min_past_rec" *
-- the "tidal_step" at the last recorded "lat"-value:
(attribute(get_feature('werkset','epoch',
("epoch" - --the epoch of the "lat" value to be interpolated minus:
left(right("timestamp",8),1) * 600000 -- = the amount of minutes after the last recorded instance.
+ left(right("timestamp",6),2) * 1000) -- and the amount of seconds after the last recorded instance.
),'tidal_step')) +
-- the last recorded "lat"-value
(attribute(get_feature('werkset','epoch',("epoch" - left(right("timestamp",8),1) * 600000 + left(right("timestamp",6),2) * 1000)),'lat'))
Run Code Online (Sandbox Code Playgroud)
使用示例中的数据:
2019-01-01 00:17:33:
"lat_interpolated" = "min_past_rec" * "tidal_step" + "lat" =
7*-0.012 + 4.95 = 4.866
Run Code Online (Sandbox Code Playgroud)
我应该在 PostgreSQL 中使用哪些语句/脚本来执行相同的任务?
我有一个(部分)解决方案 - 我所做的如下(请参阅此处可用的小提琴):
我用于插值的算法是
如果存在序列 1 NULL
,则取上方值和下方值的平均值。
一个 2 秒的序列NULL
,顶部分配的值是其上方两条记录的平均值,底部分配的值是其下方两条记录的平均值。
为了做到这一点,我做了以下事情:
创建一个表:
CREATE TABLE data
(
s SERIAL PRIMARY KEY,
t TIMESTAMP,
lat NUMERIC
);
Run Code Online (Sandbox Code Playgroud)
用一些示例数据填充它:
INSERT INTO data (t, lat)
VALUES
('2019-01-01 00:00:00', 5.07),
('2019-01-01 01:00:00', 4.60),
('2019-01-01 02:00:00', NULL),
('2019-01-01 03:00:00', NULL),
('2019-01-01 04:00:00', 4.7),
('2019-01-01 05:00:00', 4.20),
('2019-01-01 06:00:00', NULL),
('2019-01-01 07:00:00', 4.98),
('2019-01-01 08:00:00', 4.50);
Run Code Online (Sandbox Code Playgroud)
请注意,记录 3、4 和 7 是NULL
。
然后我运行了第一个查询:
WITH cte1 AS
(
SELECT d1.s,
d1.t AS t1, d1.lat AS l1,
LAG(d1.lat, 2) OVER (ORDER BY t ASC) AS lag_t1_2,
LAG(d1.lat, 1) OVER (ORDER BY t ASC) AS lag_t1,
LEAD(d1.lat, 1) OVER (ORDER BY t ASC) AS lead_t1,
LEAD(d1.lat, 2) OVER (ORDER BY t ASC) AS lead_t1_2
FROM data d1
),
cte2 AS
(
SELECT
d2.t AS t2, d2.lat AS l2,
LAG(d2.lat, 1) OVER(ORDER BY t DESC) AS lag_t2,
LEAD(d2.lat, 1) OVER(ORDER BY t DESC) AS lead_t2
FROM data d2
),
cte3 AS
(
SELECT t1.s,
t1.t1, t1.lag_t1_2, t1.lag_t1, t2.lag_t2, t1.l1, t2.l2,
t1.lead_t1, t2.lead_t2, t1.lead_t1_2
FROM cte1 t1
JOIN cte2 t2
ON t1.t1 = t2.t2
)
SELECT * FROM cte3;
Run Code Online (Sandbox Code Playgroud)
结果(空格的意思NULL
- 在小提琴上更清楚):
s t1 lag_t1_2 lag_t1 lag_t2 l1 l2 lead_t1 lead_t2 lead_t1_2
1 2019-01-01 00:00:00 4.60 5.07 5.07 4.60
2 2019-01-01 01:00:00 5.07 4.60 4.60 5.07
3 2019-01-01 02:00:00 5.07 4.60 4.60 4.7
4 2019-01-01 03:00:00 4.60 4.7 4.7 4.20
5 2019-01-01 04:00:00 4.20 4.7 4.7 4.20
6 2019-01-01 05:00:00 4.7 4.20 4.20 4.7 4.98
7 2019-01-01 06:00:00 4.7 4.20 4.98 4.98 4.20 4.50
8 2019-01-01 07:00:00 4.20 4.50 4.98 4.98 4.50
9 2019-01-01 08:00:00 4.98 4.50 4.50 4.98
Run Code Online (Sandbox Code Playgroud)
LAG()
请注意和LEAD()
窗口函数 ( )的使用documentation
。我在同一张桌子上使用过它们,但排序不同。
这和使用该OFFSET
选项意味着,从我原来的单列lat
,我现在有 6 个额外的“生成”数据列,这些数据对于将值分配给缺失NULL
值非常有用。拼图的最后一个(部分)部分如下所示(完整的 SQL 查询位于本文的底部,也在小提琴中)。
cte4 AS
(
SELECT t1.s,
t1.l1 AS lat,
CASE
WHEN (t1.l1 IS NOT NULL) THEN t1.l1
WHEN (t1.l1 IS NULL) AND (t1.l2) IS NULL AND (t1.lag_t1 IS NOT NULL)
AND (t1.lag_t2 IS NOT NULL) THEN ROUND((t1.lag_t1 + t1.lag_t2)/2, 2)
WHEN (t1.lag_t2 IS NULL) AND (t1.l1 IS NULL) AND (t1.l2 IS NULL)
AND (t1.lead_t1 IS NULL) THEN ROUND((t1.lag_t1 + t1.lag_t1_2)/2, 2)
WHEN (t1.l1 IS NULL) AND (t1.l2 IS NULL) AND (t1.lag_t1 IS NULL)
AND (t1.lead_t2 IS NULL) THEN ROUND((t1.lead_t1 + t1.lead_t1_2)/2, 2)
ELSE 0
END AS final_val
FROM cte3 t1
)
SELECT s, lat, final_val FROM cte4;
Run Code Online (Sandbox Code Playgroud)
最后结果:
s lat final_val
1 5.07 5.07
2 4.60 4.60
3 NULL 4.84
4 NULL 4.45
5 4.7 4.7
6 4.20 4.20
7 NULL 4.59
8 4.98 4.98
9 4.50 4.50
Run Code Online (Sandbox Code Playgroud)
因此,您可以看到记录 7 的计算值是记录 6 和 8 的平均值,记录 3 是记录 1 和 2 的平均值,记录 4 的分配值是记录 5 和 6 的平均值。这是通过以下方式启用的使用和函数OFFSET
的选项。如果你得到 3 的序列,那么你将不得不使用3 的序列,依此类推。LAG()
LEAD()
NULL
OFFSET
我对这个解决方案不太满意 - 它涉及对NULL
s 的数量进行硬编码,并且这些CASE
语句将变得更加复杂和可怕。RECURSIVE CTE
理想情况下需要某种解决方案,但我很乐意!
================================= 完整查询 ================= =======
WITH cte1 AS
(
SELECT d1.s,
d1.t AS t1, d1.lat AS l1,
LAG(d1.lat, 2) OVER (ORDER BY t ASC) AS lag_t1_2,
LAG(d1.lat, 1) OVER (ORDER BY t ASC) AS lag_t1,
LEAD(d1.lat, 1) OVER (ORDER BY t ASC) AS lead_t1,
LEAD(d1.lat, 2) OVER (ORDER BY t ASC) AS lead_t1_2
FROM data d1
),
cte2 AS
(
SELECT
d2.t AS t2, d2.lat AS l2,
LAG(d2.lat, 1) OVER(ORDER BY t DESC) AS lag_t2,
LEAD(d2.lat, 1) OVER(ORDER BY t DESC) AS lead_t2
FROM data d2
),
cte3 AS
(
SELECT t1.s,
t1.t1, t1.lag_t1_2, t1.lag_t1, t2.lag_t2, t1.l1, t2.l2,
t1.lead_t1, t2.lead_t2, t1.lead_t1_2
FROM cte1 t1
JOIN cte2 t2
ON t1.t1 = t2.t2
),
cte4 AS
(
SELECT t1.s,
t1.l1 AS lat,
CASE
WHEN (t1.l1 IS NOT NULL) THEN t1.l1
WHEN (t1.l1 IS NULL) AND (t1.l2) IS NULL AND (t1.lag_t1 IS NOT NULL)
AND (t1.lag_t2 IS NOT NULL) THEN ROUND((t1.lag_t1 + t1.lag_t2)/2, 2)
WHEN (t1.lag_t2 IS NULL) AND (t1.l1 IS NULL) AND (t1.l2 IS NULL)
AND (t1.lead_t1 IS NULL) THEN ROUND((t1.lag_t1 + t1.lag_t1_2)/2, 2)
WHEN (t1.l1 IS NULL) AND (t1.l2 IS NULL) AND (t1.lag_t1 IS NULL)
AND (t1.lead_t2 IS NULL) THEN ROUND((t1.lead_t1 + t1.lead_t1_2)/2, 2)
ELSE 0
END AS final_val,
t1.lead_t1_2
FROM cte3 t1
)
SELECT s, lat, final_val, lead_t1_2 FROM cte4;
Run Code Online (Sandbox Code Playgroud)