Global Historical Climatology Network在其天气测量集合中标记了无效或错误的数据.删除这些元素后,有大量数据不再具有连续的日期部分.数据类似于:
"2007-12-01";14 -- Start of December "2007-12-29";8 "2007-12-30";11 "2007-12-31";7 "2008-01-01";8 -- Start of January "2008-01-02";12 "2008-01-29";0 "2008-01-31";7 "2008-02-01";4 -- Start of February ... entire month is complete ... "2008-02-29";12 "2008-03-01";14 -- Start of March "2008-03-02";17 "2008-03-05";17
问题(输出)
虽然可以推断缺失的数据(例如,通过平均其他年份)来提供连续的范围,但为了简化系统,我想根据是否有一个连续的日期范围填写月份来标记非连续的段:
D;"2007-12-01";14 -- Start of December D;"2007-12-29";8 D;"2007-12-30";11 D;"2007-12-31";7 D;"2008-01-01";8 -- Start of January D;"2008-01-02";12 D;"2008-01-29";0 D;"2008-01-31";7 "2008-02-01";4 -- Start of February ... entire month is complete ... "2008-02-29";12 D;"2008-03-01";14 -- Start of March D;"2008-03-02";17 D;"2008-03-05";17
一些测量是在1843年进行的.
题
对于所有气象站,您如何标记缺少一天或多天的所有日期?
源代码
选择数据的代码类似于:
select m.id,m.taken,m.station_id,m.amount from climate.measurement
相关的想法
生成一个填充了连续日期的表,并将它们与测量数据日期进行比较.
> What is the most straightforward way to pad empty dates in sql results (on either mysql or perl end)?
> How do I group on continuous ranges
> http://msdn.microsoft.com/en-us/library/aa175780%28v=sql.80%29.aspx
更新
可以使用本节中的sql重新创建该问题.
表
该表创建如下:
CREATE TABLE climate.calendar ( id serial NOT NULL,n character varying(2) NOT NULL,d date NOT NULL,"valid" boolean NOT NULL DEFAULT true,CONSTRAINT calendar_pk PRIMARY KEY (id) ) WITH ( OIDS=FALSE );
生成数据
以下sql将数据插入表(id [int],name [varchar],date [date],valid [boolean]):
insert into climate.calendar (n,d) select 'A',(date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0,date('2011-04-9') - date('1982-01-1') ) n insert into climate.calendar (n,d) select 'B',d) select 'C',d) select 'D',d) select 'E',d) select 'F',date('2011-04-9') - date('1982-01-1') ) n
值“A”到“F”表示在特定日期进行测量的气象站的名称.
删除一些行如下:
delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);
尝试#1
以下内容不会将月份中缺少一天或多天的所有日期的有效标记切换为false:
UPDATE climate.calendar SET valid = false WHERE date_trunc('month',d) IN ( SELECT DISTINCT date_trunc('month',d) FROM climate.calendar A WHERE NOT EXISTS ( SELECT 1 FROM climate.calendar B WHERE A.d - 1 = B.d ) );
尝试#2
with gen_calendar as ( select (date('1982-01-1') + (n || ' days')::interval)::date cal_date from generate_series(0,date('2011-04-9') - date('1982-01-1') ) n ) select gc.cal_date from gen_calendar gc left join climate.calendar c on c.d = gc.cal_date where c.d is null;
尝试#3
select distinct( cc.n ),t.d from climate.calendar cc,( select (date('1982-01-1') + (n || ' days')::interval)::date d from generate_series(0,date('2011-04-9') - date('1982-01-1') ) n ) t order by cc.n
然而,在真实数据中有几百个站,日期可以追溯到19世纪中期,因此所有站的所有日期的笛卡尔都太大了.如果有足够的时间,这样的方法可能会有效……必须有更快的方法.
尝试#4
How to select specific changes using windowing functions in postgres
谢谢!
解决方法
Postgresql的generate_series()函数可以创建一个包含连续日期列表的视图:
with calendar as ( select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date from generate_series(0,(select max(date) - min(date) from test)) n ) select cal_date from calendar c left join test t on t.date = c.cal_date where t.date is null;
表达式select max(date) – min(date)from test可能会被一个关闭.
每月计算天数
识别无效月份的一种方法是创建两个视图.第一个计算每个站每个月应该生成的每日读数. (请注意,climate.calendar已转换为climate_calendar.)第二个返回每个站每月生成的实际每日读数.
每个站每月最多天数
此视图将返回每个工作站一个月的实际天数. (例如,二月将总是有28天或29天.)
create view count_max_station_calendar_days as with calendar as ( select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date from generate_series(0,(select max(d) - min(d) from climate_calendar)) n ) select n,extract(year from cal_date) yr,extract(month from cal_date) mo,count(*) num_days from stations cross join calendar group by n,yr,mo order by n,mo
每个站每月的实际天数
返回的总天数将少于标签. (例如,1月将持续31天或更少.)
create view count_actual_station_calendar_days as select n,extract(year from d) yr,extract(month from d) mo,count(*) num_days from climate_calendar group by n,mo;
删除生产中的ORDER BY子句(它们有助于开发).
比较视图
加入两个视图以将需要标记的工作站和月份标识到新视图中:
create view invalid_station_months as select m.n,m.yr,m.mo,m.num_days - a.num_days num_days_missing from count_max_station_calendar_days m inner join count_actual_station_calendar_days a on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days) n yr mo num_days_missing -- A 1982 1 1 E 2007 3 1
列num_days_missing不是必需的,但它很有用.
这些是需要更新的行:
select cc.* from climate_calendar cc inner join invalid_station_months im on (cc.n = im.n and extract(year from cc.d) = im.yr and extract(month from cc.d) = im.mo) where valid = true
更新数据库
要更新它们,id键很方便.
update climate_calendar set valid = false where id in ( select id from climate_calendar cc inner join invalid_station_months im on (cc.n = im.n and extract(year from cc.d) = im.yr and extract(month from cc.d) = im.mo) where valid = true );