Dav*_*vis 8 sql postgresql date-range gaps-and-islands contiguous
在全球历史气候网络已其收藏的天气测量的标记无效或错误的数据.删除这些元素后,有大量数据不再具有连续的日期部分.数据类似于:
"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14 -- Start of March
"2008-03-02";17
"2008-03-05";17
Run Code Online (Sandbox Code Playgroud)
虽然可以推断缺失的数据(例如,通过平均其他年份)来提供连续的范围,但为了简化系统,我想根据是否有一个连续的日期范围填写月份来标记非连续的段:
D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14 -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17
Run Code Online (Sandbox Code Playgroud)
一些测量是在1843年进行的.
对于所有气象站,您如何标记缺少一天或多天的所有日期?
选择数据的代码类似于:
select
m.id,
m.taken,
m.station_id,
m.amount
from
climate.measurement
Run Code Online (Sandbox Code Playgroud)
生成一个填充了连续日期的表,并将它们与测量数据日期进行比较.
可以使用本节中的SQL重新创建该问题.
该表创建如下:
CREATE TABLE climate.calendar
(
id serial NOT NULL,
n character varying(2) NOT NULL,
d date NOT NULL,
"valid" boolean NOT NULL DEFAULT true,
CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
Run Code Online (Sandbox Code Playgroud)
以下SQL将数据插入表(id
[int],n
ame [varchar],d
ate [date],valid
[boolean]):
insert into climate.calendar (n, d)
select 'A', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'B', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'C', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'D', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'E', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n, d)
select 'F', (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
Run Code Online (Sandbox Code Playgroud)
值'A'
通过'F'
表示在特定日期进行测量的气象站的名称.
删除一些行如下:
delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);
Run Code Online (Sandbox Code Playgroud)
下列不切换的valid
标志,false
在一个月的所有天,其中一个月是缺少一个或多个天:
UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month', d) IN (
SELECT DISTINCT date_trunc('month', d)
FROM climate.calendar A
WHERE NOT EXISTS (
SELECT 1
FROM climate.calendar B
WHERE A.d - 1 = B.d
)
);
Run Code Online (Sandbox Code Playgroud)
以下SQL生成一个空结果集:
with gen_calendar as (
select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;
Run Code Online (Sandbox Code Playgroud)
以下SQL生成了站名和日期的所有可能组合:
select
distinct( cc.n ), t.d
from
climate.calendar cc,
(
select (date('1982-01-1') + (n || ' days')::interval)::date d
from generate_series(0, date('2011-04-9') - date('1982-01-1') ) n
) t
order by
cc.n
Run Code Online (Sandbox Code Playgroud)
然而,在真实数据中有几百个站,日期可以追溯到19世纪中期,因此所有站的所有日期的笛卡尔都太大了.如果有足够的时间,这样的方法可能会有效......必须有更快的方法.
PostgreSQL具有窗口函数.
谢谢!
PostgreSQL 的generate_series()
函数可以创建一个包含连续日期列表的视图:
with calendar as (
select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date
from generate_series(0, (select max(date) - min(date) from test)) n
)
select cal_date
from calendar c
left join test t on t.date = c.cal_date
where t.date is null;
Run Code Online (Sandbox Code Playgroud)
表达式select max(date) - min(date) from test
可能相差一个。
识别无效月份的一种方法是创建两个视图。第一个计算每个站点每个月应该产生的每日读数数量。(注意climate.calendar
转换为climate_calendar
。)第二个返回每个站点每月产生的实际每日读数。
此视图将返回每个站点一个月中的实际天数。(例如,二月总是有 28 或 29 天。)
create view count_max_station_calendar_days as
with calendar as (
select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date
from generate_series(0, (select max(d) - min(d) from climate_calendar)) n
)
select n, extract(year from cal_date) yr, extract(month from cal_date) mo, count(*) num_days
from stations cross join calendar
group by n, yr, mo
order by n, yr, mo
Run Code Online (Sandbox Code Playgroud)
返回的总天数将少于计数。(例如,一月将始终有 31 天或更少。)
create view count_actual_station_calendar_days as
select n, extract(year from d) yr, extract(month from d) mo, count(*) num_days
from climate_calendar
group by n, yr, mo
order by n, yr, mo;
Run Code Online (Sandbox Code Playgroud)
删除ORDER BY
生产中的条款(它们有助于开发)。
将两个视图连接起来,将需要标记的站点和月份标识到一个新视图中:
create view invalid_station_months as
select m.n, m.yr, m.mo, m.num_days - a.num_days num_days_missing
from count_max_station_calendar_days m
inner join count_actual_station_calendar_days a
on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days)
n yr mo num_days_missing
--
A 1982 1 1
E 2007 3 1
Run Code Online (Sandbox Code Playgroud)
该列num_days_missing
不是必需的,但很有用。
这些是需要更新的行:
select cc.*
from climate_calendar cc
inner join invalid_station_months im
on (cc.n = im.n and
extract(year from cc.d) = im.yr and
extract(month from cc.d) = im.mo)
where valid = true
Run Code Online (Sandbox Code Playgroud)
要更新它们,id
关键是方便。
update climate_calendar
set valid = false
where id in (
select id
from climate_calendar cc
inner join invalid_station_months im
on (cc.n = im.n and
extract(year from cc.d) = im.yr and
extract(month from cc.d) = im.mo)
where valid = true
);
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1890 次 |
最近记录: |