Recently I had a table such as this:

ID
 BEGIN_DATE
2600    
 01-JAN-08
2600    
 01-FEB-08
2600    
 01-MAR-08
2600    
 01-JUL-13
3318    
 01-APR-08
3318    
 01-JUL-13

which signifies when a particular ID is active on a monthly basis.  When it isn’t, it just does not show up.  I thought it would be more concise and useful for some types of analytics in this form:

ID
    SPAN_ID
 SPAN_BEGI
 SPAN_END 
 SPAN_DURATION
2600
          1
 01-JAN-08
 01-APR-08
            91
2600
          2
 01-JUL-13
 01-AUG-13
            31
3318
          1
 01-APR-08
 01-MAY-08
            30
3318
          2
 01-JUL-13
 01-AUG-13
            31
which for the most part looks similarly, however collapses continuous months into a span.  Although this particular subset shows the contrary, most ID spans are continuous over several months yielding less records (collapsing sparse dates would not be very useful). Now the guys over at Ask Tom did answer this question with a rather nice solution, however I had two issues:
  1. It found gaps as opposed to spans.  I figure you could add the max and min dates to the table and shift the dates to fix this, however I didn’t want to add an additional step (granted there is probably a simple fix for this that I didn’t see)
  2. The code appeared to be creating a list of all dates in the spans.  This seemed to be an overkill which I expected to hurt the computational speed.

So as I usually do, instead of trying to fix something I did not quite understand, I decided to simply start from scratch.  My method starts with a unique triplet of ID, Begin Date, End Date which are technically spans themselves.  The original data was not actually unique nor did it have an end date, however it was implied it was the first of the following month. Note that there will be an equality between Begin Date and End Date for continuous spans (just my particular convention).

Once you have that, this script takes over:
  1. Lag End Date and lead Begin Date since intermediate
  2. Filter out Lag End Date = Begin Date and Lead Begin Date = End Date since these are the intermediate dates we don’t care about.
  3. Shift the new End Date up (lead) to match the Begin Date record (this is kind of like pivoting)
  4. Filter out the End Date records

Now that would be it however there was one problem: I was not catching the single record spans (i.e. last 3 records in the above tables).  I fixed these somewhat separately by finding them and changing their End Date (see the last decode lines).  
 with
    /* Lead/Lag Begin/End dates */
 t as (
 select
     id,
     begin_date,
     lag(end_date, 1, to_date('1800-01-15', 'yyyy-mm-dd')) over (
  partition by id order by begin_date) as lag_end,
     end_date,
     lead(begin_date, 1, to_date('1800-01-15', 'yyyy-mm-dd')) over (
  partition by id order by begin_date) as lead_begin 
 from
     preprocessed_table
     ),
    /* Remove all intermediate dates and pulls end_date up */
 s as (
 select
     id,
     begin_date,
     end_date,
     lead(end_date) over (
  partition by id order by begin_date) new_end_date,
     (case
  when
  end_date = lead_begin
  then 'begin'
  when
  begin_date = lag_end
  then 'end'
  else 'single'
  end) as b
 from
     t
 where
     begin_date != lag_end or end_date != lead_begin)
    /* Additional calculations and fix for single month enrollments */
    select
 id,
       row_number() over (
     partition by id order by begin_date) as span_id,
 begin_date as span_begin,
 decode(b, 'single',
     end_date, new_end_date) as span_end,
 decode(b, 'single',
     end_date, new_end_date) - begin_date as span_duration
    from
 s
    where
 b in ('begin', 'single')
 ;

Voila you now have a simple fast script to transform your data into a list of spans.  As I mentioned, this was done on monthly data, however it should work on any interval, as long as the begin and end match up.  Additionally it would work with multiple ID variables simply by adding those into the partition function.

Gaps

Since I did mention gaps, I should give you that code also.  Since this basically the inverser, all that is needed is to shift the dates, relabel, and filter out the odd (or actually even) records.  Since there will always be one less record denoted by a NULL this needs to be filtered out also.

     with s as (
 select
     master_member_id,
     span_end as gap_begin,
     lead(span_begin) over (
  partition by master_member_id order by span_begin) as gap_end,
     row_number() over (
  partition by master_member_id order by span_begin) as r
 from
     enrollment
 where
     master_member_id in ('100135', '100982')
     )
    select
 master_member_id,
 gap_begin,
 gap_end
    from
 s
    where
 mod(r, 2) = 1 and gap_end is not null;

Final Word (of Caution)

Although the initial data set appears to be a list of date spans, this code does not actually view it as such, in that if two of the spans overlapped problems would arise.  For instance if an ID has a begin and end of 1/1/2012 – 2/1/2012 and another one with 1/15 – 2/15, the shifting will not match the dates and treat these as separate spans as oppose to 1/1 – 2/15.  I have not run into a need for this yet, however I would expect some novel pre-processing should easily fix this (note Tom’s code should not have this problem).

Leave a comment