[R] Processing key_column, begin_date, end_date in R

Thu Feb 26 15:30:56 CET 2015

Here is another way. Have not tested for large scale efficiency, but if you convert dta to a data.table that might improve things.

library(dplyr)
dta <- read.csv( text=
"key_column,begin_date,end_date
123456,2013-01-01,2014-01-01
123456,2013-07-01,2014-07-01
789102,2012-03-01,2014-03-01
789102,2015-02-01,2016-02-01
789102,2015-02-06,2016-02-06
789102,2015-02-28,2015-03-31
789102,2015-04-30,2015-05-31
", as.is=TRUE)
( dta
%>% mutate( begin_date = as.Date( begin_date ),
end_date = as.Date( end_date ) )
%>% arrange( key_column, begin_date )
) -> dta

mkgp <- function( begin_date, cend ) {
  ix <- c( TRUE, cend[ -length( begin_date ) ] < begin_date[ -1 ] )
  cumsum( ix )
}

result <- ( dta
          %>% group_by( key_column )
          %>% mutate( cend = as.Date( cummax( as.numeric( end_date ) )
                                    , origin="1970-01-01" )
                      , gp = mkgp( begin_date, cend )
                      )
          %>% ungroup
          %>% group_by( key_column, gp )
          %>% summarise(  begin_date = begin_date[ 1 ]
                        , end_date = cend[ length( cend ) ]
                        )
          %>% ungroup
          %>% select( -gp )
          %>% as.data.frame
          )
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On February 25, 2015 1:18:58 PM PST, Matt Gross <grossm at gmail.com> wrote:
>Hi,
>
>I am trying to process a large dataset in R.  The dataset contains the
>following three columns:
>
>key_column - a unique key identifier
>begin_date - the start date of the active period
>end_date - the end date of the active period
>
>
>Example data is here:
>
>key_column,begin_date,end_date
>123456,2013-01-01,2014-01-01
>123456,2013-07-01,2014-07-01
>789102,2012-03-01,2014-03-01
>789102,2015-02-01,2016-02-01
>789102,2015-02-06,2016-02-06
>
>I want to build a condensed table of key_column and begin_date's and
>end_date's.  As you can see in the example data above, some begin and
>end
>date periods overlap with begin_date and end_date pairs for the same
>key_column.  In situations where overlap exists I want to have one
>record
>for the key_column with the min(begin_date) and the max(end_date).
>
>Can anyone help me build the commands to process this data in R?
>
>Thanks,
>Matt