sql – 标记不连续的日期范围

背景(输入)

Global Historical Climatology Network在其天气测量集合中标记了无效或错误的数据.删除这些元素后,有大量数据不再具有连续的日期部分.数据类似于：

"2007-12-01";14 -- Start of December
"2007-12-29";8
"2007-12-30";11
"2007-12-31";7
"2008-01-01";8 -- Start of January
"2008-01-02";12
"2008-01-29";0
"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
"2008-03-01";14  -- Start of March
"2008-03-02";17
"2008-03-05";17

问题(输出)

虽然可以推断缺失的数据(例如,通过平均其他年份)来提供连续的范围,但为了简化系统,我想根据是否有一个连续的日期范围填写月份来标记非连续的段：

D;"2007-12-01";14 -- Start of December
D;"2007-12-29";8
D;"2007-12-30";11
D;"2007-12-31";7
D;"2008-01-01";8 -- Start of January
D;"2008-01-02";12
D;"2008-01-29";0
D;"2008-01-31";7
"2008-02-01";4 -- Start of February
... entire month is complete ...
"2008-02-29";12
D;"2008-03-01";14  -- Start of March
D;"2008-03-02";17
D;"2008-03-05";17

一些测量是在1843年进行的.

题

对于所有气象站,您如何标记缺少一天或多天的所有日期？

源代码

选择数据的代码类似于：

select
  m.id,m.taken,m.station_id,m.amount
from
  climate.measurement

相关的想法

生成一个填充了连续日期的表,并将它们与测量数据日期进行比较.

> What is the most straightforward way to pad empty dates in sql results (on either mysql or perl end)?
> How do I group on continuous ranges
> http://msdn.microsoft.com/en-us/library/aa175780%28v=sql.80%29.aspx

更新

可以使用本节中的sql重新创建该问题.

表

该表创建如下：

CREATE TABLE climate.calendar
(
  id serial NOT NULL,n character varying(2) NOT NULL,d date NOT NULL,"valid" boolean NOT NULL DEFAULT true,CONSTRAINT calendar_pk PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

生成数据

以下sql将数据插入表(id [int],name [varchar],date [date],valid [boolean])：

insert into climate.calendar (n,d) 
    select 'A',(date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0,date('2011-04-9') - date('1982-01-1') ) n
insert into climate.calendar (n,d) 
    select 'B',d) 
    select 'C',d) 
    select 'D',d) 
    select 'E',d) 
    select 'F',date('2011-04-9') - date('1982-01-1') ) n

值“A”到“F”表示在特定日期进行测量的气象站的名称.

删除随机行

删除一些行如下：

delete from climate.calendar where id in (select id from climate.calendar order by random() limit 5000);

尝试#1

以下内容不会将月份中缺少一天或多天的所有日期的有效标记切换为false：

UPDATE climate.calendar
SET valid = false
WHERE date_trunc('month',d) IN (
    SELECT DISTINCT date_trunc('month',d)
    FROM climate.calendar A
    WHERE NOT EXISTS (
        SELECT 1
        FROM climate.calendar B
        WHERE A.d - 1 = B.d
   )
);

尝试#2

以下sql 生成一个空结果集：

with gen_calendar as (
    select (date('1982-01-1') + (n || ' days')::interval)::date cal_date
    from generate_series(0,date('2011-04-9') - date('1982-01-1') ) n
)
select gc.cal_date
from gen_calendar gc
left join climate.calendar c on c.d = gc.cal_date
where c.d is null;

尝试#3

以下sql 生成了站名和日期的所有可能组合：

select
  distinct( cc.n ),t.d
from
  climate.calendar cc,(
    select (date('1982-01-1') + (n || ' days')::interval)::date d
    from generate_series(0,date('2011-04-9') - date('1982-01-1') ) n
  ) t
order by
  cc.n

然而,在真实数据中有几百个站,日期可以追溯到19世纪中期,因此所有站的所有日期的笛卡尔都太大了.如果有足够的时间,这样的方法可能会有效……必须有更快的方法.

尝试#4

Postgresql具有窗口函数.

How to select specific changes using windowing functions in postgres

谢谢！

@R_301_323@

generate_series()

Postgresql的generate_series()函数可以创建一个包含连续日期列表的视图：

with calendar as (
    select ((select min(date) from test)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0,(select max(date) - min(date) from test)) n
)
select cal_date
from calendar c
left join test t on t.date = c.cal_date
where t.date is null;

表达式select max(date) – min(date)from test可能会被一个关闭.

每月计算天数

识别无效月份的一种方法是创建两个视图.第一个计算每个站每个月应该生成的每日读数. (请注意,climate.calendar已转换为climate_calendar.)第二个返回每个站每月生成的实际每日读数.

每个站每月最多天数

此视图将返回每个工作站一个月的实际天数. (例如,二月将总是有28天或29天.)

create view count_max_station_calendar_days as 
with calendar as (
    select ((select min(d) from climate_calendar)::date + (n || ' days')::interval)::date cal_date
    from generate_series(0,(select max(d) - min(d) from climate_calendar)) n
)
select n,extract(year from cal_date) yr,extract(month from cal_date) mo,count(*) num_days
from stations cross join calendar
group by n,yr,mo
order by n,mo

每个站每月的实际天数

返回的总天数将少于标签. (例如,1月将持续31天或更少.)

create view count_actual_station_calendar_days as
select n,extract(year from d) yr,extract(month from d) mo,count(*) num_days
from climate_calendar
group by n,mo;

删除生产中的ORDER BY子句(它们有助于开发).

比较视图

加入两个视图以将需要标记的工作站和月份标识到新视图中：

create view invalid_station_months as 
select m.n,m.yr,m.mo,m.num_days - a.num_days num_days_missing
from count_max_station_calendar_days m
inner join count_actual_station_calendar_days a
       on (m.n = a.n and m.yr = a.yr and m.mo = a.mo and m.num_days <> a.num_days)

n   yr    mo  num_days_missing
--
A   1982  1   1
E   2007  3   1

列num_days_missing不是必需的,但它很有用.

这些是需要更新的行：

select cc.* 
from climate_calendar cc
inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
where valid = true

更新数据库

要更新它们,id键很方便.

update climate_calendar
set valid = false
where id in (
    select id
    from climate_calendar cc
    inner join invalid_station_months im 
        on (cc.n = im.n and 
            extract(year from cc.d) = im.yr and
            extract(month from cc.d) = im.mo)
    where valid = true
);

sql – 标记不连续的日期范围

@R_301_323@

猜你在找的MsSQL相关文章