Recently I had a table such as this:
ID
|
BEGIN_DATE
|
2600
|
01-JAN-08
|
2600
|
01-FEB-08
|
2600
|
01-MAR-08
|
2600
|
01-JUL-13
|
3318
|
01-APR-08
|
3318
|
01-JUL-13
|
…
|
…
|
which signifies when a particular ID is active on a monthly basis. When it isn’t, it just does not show up. I thought it would be more concise and useful for some types of analytics in this form:
ID
|
SPAN_ID
|
SPAN_BEGI
|
SPAN_END
|
SPAN_DURATION
|
2600
|
1
|
01-JAN-08
|
01-APR-08
|
91
|
2600
|
2
|
01-JUL-13
|
01-AUG-13
|
31
|
3318
|
1
|
01-APR-08
|
01-MAY-08
|
30
|
3318
|
2
|
01-JUL-13
|
01-AUG-13
|
31
|
…
|
…
|
…
|
…
|
…
|
- It found gaps as opposed to spans. I figure you could add the max and min dates to the table and shift the dates to fix this, however I didn’t want to add an additional step (granted there is probably a simple fix for this that I didn’t see)
- The code appeared to be creating a list of all dates in the spans. This seemed to be an overkill which I expected to hurt the computational speed.
So as I usually do, instead of trying to fix something I did not quite understand, I decided to simply start from scratch. My method starts with a unique triplet of ID, Begin Date, End Date which are technically spans themselves. The original data was not actually unique nor did it have an end date, however it was implied it was the first of the following month. Note that there will be an equality between Begin Date and End Date for continuous spans (just my particular convention).
- Lag End Date and lead Begin Date since intermediate
- Filter out Lag End Date = Begin Date and Lead Begin Date = End Date since these are the intermediate dates we don’t care about.
- Shift the new End Date up (lead) to match the Begin Date record (this is kind of like pivoting)
- Filter out the End Date records
with /* Lead/Lag Begin/End dates */ t as ( select id, begin_date, lag(end_date, 1, to_date('1800-01-15', 'yyyy-mm-dd')) over ( partition by id order by begin_date) as lag_end, end_date, lead(begin_date, 1, to_date('1800-01-15', 'yyyy-mm-dd')) over ( partition by id order by begin_date) as lead_begin from preprocessed_table ), /* Remove all intermediate dates and pulls end_date up */ s as ( select id, begin_date, end_date, lead(end_date) over ( partition by id order by begin_date) new_end_date, (case when end_date = lead_begin then 'begin' when begin_date = lag_end then 'end' else 'single' end) as b from t where begin_date != lag_end or end_date != lead_begin) /* Additional calculations and fix for single month enrollments */ select id, row_number() over ( partition by id order by begin_date) as span_id, begin_date as span_begin, decode(b, 'single', end_date, new_end_date) as span_end, decode(b, 'single', end_date, new_end_date) - begin_date as span_duration from s where b in ('begin', 'single') ;
Voila you now have a simple fast script to transform your data into a list of spans. As I mentioned, this was done on monthly data, however it should work on any interval, as long as the begin and end match up. Additionally it would work with multiple ID variables simply by adding those into the partition function.
Gaps
Since I did mention gaps, I should give you that code also. Since this basically the inverser, all that is needed is to shift the dates, relabel, and filter out the odd (or actually even) records. Since there will always be one less record denoted by a NULL this needs to be filtered out also.
with s as ( select master_member_id, span_end as gap_begin, lead(span_begin) over ( partition by master_member_id order by span_begin) as gap_end, row_number() over ( partition by master_member_id order by span_begin) as r from enrollment where master_member_id in ('100135', '100982') ) select master_member_id, gap_begin, gap_end from s where mod(r, 2) = 1 and gap_end is not null;