[R] Processing key_column, begin_date, end_date in R
Jeff Newmiller
jdnewmil at dcn.davis.CA.us
Thu Feb 26 15:30:56 CET 2015
Here is another way. Have not tested for large scale efficiency, but if you convert dta to a data.table that might improve things.
library(dplyr)
dta <- read.csv( text=
"key_column,begin_date,end_date
123456,2013-01-01,2014-01-01
123456,2013-07-01,2014-07-01
789102,2012-03-01,2014-03-01
789102,2015-02-01,2016-02-01
789102,2015-02-06,2016-02-06
789102,2015-02-28,2015-03-31
789102,2015-04-30,2015-05-31
", as.is=TRUE)
( dta
%>% mutate( begin_date = as.Date( begin_date ),
end_date = as.Date( end_date ) )
%>% arrange( key_column, begin_date )
) -> dta
mkgp <- function( begin_date, cend ) {
ix <- c( TRUE, cend[ -length( begin_date ) ] < begin_date[ -1 ] )
cumsum( ix )
}
result <- ( dta
%>% group_by( key_column )
%>% mutate( cend = as.Date( cummax( as.numeric( end_date ) )
, origin="1970-01-01" )
, gp = mkgp( begin_date, cend )
)
%>% ungroup
%>% group_by( key_column, gp )
%>% summarise( begin_date = begin_date[ 1 ]
, end_date = cend[ length( cend ) ]
)
%>% ungroup
%>% select( -gp )
%>% as.data.frame
)
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On February 25, 2015 1:18:58 PM PST, Matt Gross <grossm at gmail.com> wrote:
>Hi,
>
>I am trying to process a large dataset in R. The dataset contains the
>following three columns:
>
>key_column - a unique key identifier
>begin_date - the start date of the active period
>end_date - the end date of the active period
>
>
>Example data is here:
>
>key_column,begin_date,end_date
>123456,2013-01-01,2014-01-01
>123456,2013-07-01,2014-07-01
>789102,2012-03-01,2014-03-01
>789102,2015-02-01,2016-02-01
>789102,2015-02-06,2016-02-06
>
>I want to build a condensed table of key_column and begin_date's and
>end_date's. As you can see in the example data above, some begin and
>end
>date periods overlap with begin_date and end_date pairs for the same
>key_column. In situations where overlap exists I want to have one
>record
>for the key_column with the min(begin_date) and the max(end_date).
>
>Can anyone help me build the commands to process this data in R?
>
>Thanks,
>Matt
More information about the R-help
mailing list