[R] R_closest date

Thu Sep 6 22:45:37 CEST 2012

Try using ave(), as in
     dat1WithMean <- within(dat1, Mean <- ave(OBS_VALUE, PT_ID, DAYS_DIFF, FUN = mean))

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of arun
> Sent: Thursday, September 06, 2012 12:21 PM
> To: Weijia Wang
> Cc: R help
> Subject: Re: [R] R_closest date
> 
> 
> 
> HI,
> 
> For your question to split by two variables, try this:
> dat1<-read.table(text="
> ID         PT_ID  BASE     IDX_DT   OBS_DATE OBS_VALUE CATEGORY
> DAYS_DIFF   LDL_BASE  rf
> 118485 9624295 164.2 2006-11-21 2009-02-17     157.6        2  819  2006-11-20 2.5
> 118486 9624295 164.2 2006-11-21 2006-09-30     208.0        2  -52  2006-11-20 2.5
> 118487 9624295 164.2 2006-11-21 2008-04-09     123.8        2  505  2006-11-20 2.5
> 118488 9624295 164.2 2006-11-21 2008-02-26      17.4        1  462  2006-11-20 2.5
> 118489 9624295 164.2 2006-11-21 2008-02-26     139.0        2  462  2006-11-20 2.5
> 118490 9624295 164.2 2006-11-21 2007-07-02     107.2        2  223  2006-11-25 2.5
> 118491 9624295 164.2 2006-11-21 2007-02-27      86.0        1   98  2006-11-25 2.5
> 118492 9624295 164.2 2006-11-21 2008-09-09     131.2        2  658  2006-11-25 2.5
> 118485 9624296 164.2 2006-11-21 2009-02-17     157.6        2  819  2006-11-26 2.5
> 118486 9624296 164.2 2006-11-21 2006-09-30     208.0        2  -52  2006-11-26 2.5
> 118487 9624296 164.2 2006-11-21 2008-04-09     123.8        2  505  2006-11-26 2.5
> 118488 9624296 164.2 2006-11-21 2008-02-26      17.4        1  462  2006-11-27 2.5
> 118489 9624296 164.2 2006-11-21 2008-02-26     139.0        2  462  2006-11-27 2.5
> 118490 9624296 164.2 2006-11-21 2007-07-02     107.2        2  223  2006-11-27 2.5
> 118491 9624296 164.2 2006-11-21 2007-02-27      86.0        1   98  2006-11-27 2.5
> 118492 9624296 164.2 2006-11-21 2008-09-09     131.2        2  658  2006-11-27 2.5
>  ",sep="",header=TRUE)
> 
> dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
> dat3<-list()
>  for(i in seq_along(dat2)){
>  dat3[[i]]<-list()
>  dat3[[i]]<-ddply(dat2[[i]],.(DAYS_DIFF),summarize,Mean=mean(OBS_VALUE))
>  }
> dat3
> do.call(rbind,dat3)
> #   DAYS_DIFF  Mean
> #1        -52 208.0
> #2        462  78.2
> #3        505 123.8
> #4        819 157.6
> #5         98  86.0
> #6        223 107.2
> #7        658 131.2
> #8        -52 208.0
> #9        505 123.8
> #10       819 157.6
> #11        98  86.0
> #12       223 107.2
> #13       462  78.2
> #14       658 131.2
> 
> 
> #Not sure whether this will work or not in your huge dataset.  May be you can try lapply()
> also.
> 
> A.K.
> ________________________________
> From: Weijia Wang <wwang.nyu at gmail.com>
> To: arun <smartpink111 at yahoo.com>
> Sent: Thursday, September 6, 2012 12:21 PM
> Subject: Re: [R] R_closest date
> 
> 
> Hi, Arun
> 
> Do you have idea about good package that split HUGE dataframe by two variables?
> 
> I was trying to use 'ddply' to calculate a mean of LDL-C values which had same date for
> every patient.
> 
> Therefore, I need to break down my dataframe, first by patient ID, then by the date of
> the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a same day.
> 
> The example is:
> 
>          PT_ID  BASE     IDX_DT   OBS_DATE OBS_VALUE CATEGORY
> DAYS_DIFF   LDL_BASE  rf
> 118485 9624295 164.2 2006-11-21 2009-02-17     157.6        2  819 days 2006-11-20 2.5
> 118486 9624295 164.2 2006-11-21 2006-09-30     208.0        2  -52 days 2006-11-20 2.5
> 118487 9624295 164.2 2006-11-21 2008-04-09     123.8        2  505 days 2006-11-20 2.5
> 118488 9624295 164.2 2006-11-21 2008-02-26      17.4        1  462 days 2006-11-20 2.5
> 118489 9624295 164.2 2006-11-21 2008-02-26     139.0        2  462 days 2006-11-20 2.5
> 118490 9624295 164.2 2006-11-21 2007-07-02     107.2        2  223 days 2006-11-20 2.5
> 118491 9624295 164.2 2006-11-21 2007-02-27      86.0        1   98 days 2006-11-20 2.5
> 118492 9624295 164.2 2006-11-21 2008-09-09     131.2        2  658 days 2006-11-20 2.5
>          REDUCTION            GOAL  FAILURE
> 118485  0.04019488     NOT AT GOAL     FAIL
> 118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
> 118487  0.24604141         AT GOAL NOT FAIL
> 118488  0.89403167         AT GOAL NOT FAIL
> 118489  0.15347138     NOT AT GOAL     FAIL
> 118490  0.34713764         AT GOAL NOT FAIL
> 118491  0.47624848         AT GOAL NOT FAIL
> 118492  0.20097442     NOT AT GOAL NOT FAIL
> 
> So, this patient has two LDL-C readings on '462 days', therefore, I want to get a mean of
> these 17.4 and 139.0.
> 
> 'ddply' did give me a mean when running on a test dataframe, but when I used it on my
> dataframe with 200,000ish observations, the computer run for like 5 hours and return
> error. Do you have idea about other good function, that focuses on split and apply
> function, and rbind?
> 
> Best
> Weijia
> 
> 
> On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:
> 
> That is actually a great idea, thanks again!
> >
> >Weijia Wang
> >
> >
> >On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
> >
> >> Hi,
> >> No problem.
> >>
> >> If you use join() instead of merge(), the original order of columns may not get altered.
> >>
> >> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
> >> library(plyr)
> >>  join(dat1,dat3,type="inner")
> >> #Joining by: PT_ID, DAYS_DIFF
> >>  # PT_ID     IDX_DT   OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
> >> #1  4549 2002-08-21 2002-08-20        -1       183        2
> >> #2  4839 2006-11-28 2006-11-28         0       179        2
> >> A.K.
> >>
> >>
> >>
> >>
> >>
> >>
> >> ________________________________
> >> From: Weijia Wang <wwang.nyu at gmail.com>
> >> To: arun <smartpink111 at yahoo.com>
> >> Sent: Saturday, September 1, 2012 5:11 PM
> >> Subject: Re: [R] R_closest date
> >>
> >>
> >> Thank you Arun, for your help again.
> >>
> >> Best
> >> ______________________________
> >> WANG WEIJIA
> >> Graudate Research and Teaching Assistant
> >> Department of Environmental Medicine
> >> New York University, School of Medicine
> >> wwang.nyu at gmail.com
> >>
> >>
> >>
> >>
> >> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote:
> >>
> >> Hi,
> >>> Try this:
> >>> dat1 <- read.table(text="
> >>>   PT_ID    IDX_DT  OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
> >>> 13  4549 2002-08-21 2002-08-20        -1      183        2
> >>> 14  4549 2002-08-21 2002-11-14        85        91        1
> >>> 15  4549 2002-08-21 2003-02-18      181        89        1
> >>> 16  4549 2002-08-21 2003-05-15      267      109        2
> >>> 17  4549 2002-08-21 2003-12-16      482        96        1
> >>> 128  4839 2006-11-28 2006-11-28        0      179        2
> >>> ", header=TRUE)
> >>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
> >>> merge(dat1,dat3)
> >>> #  PT_ID DAYS_DIFF     IDX_DT   OBS_DATE OBS_VALUE CATEGORY
> >>> #1  4549        -1 2002-08-21 2002-08-20       183        2
> >>> #2  4839         0 2006-11-28 2006-11-28       179        2
> >>>
> >>> #or,
> >>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
> >>> dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
> >>>  row.names(dat4)<-1:nrow(dat4)
> >>> merge(dat1,dat4)
> >>> #  PT_ID DAYS_DIFF     IDX_DT   OBS_DATE OBS_VALUE CATEGORY
> >>> #1  4549        -1 2002-08-21 2002-08-20       183        2
> >>> #2  4839         0 2006-11-28 2006-11-28       179        2
> >>> A.K.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>> From: WANG WEIJIA <wwang.nyu at gmail.com>
> >>> To: "r-help at R-project.org" <r-help at r-project.org>
> >>> Cc:
> >>> Sent: Saturday, September 1, 2012 1:10 PM
> >>> Subject: [R] R_closest date
> >>>
> >>> Hi,
> >>>
> >>> I have encountered an issue about finding a date closest to another date
> >>>
> >>> So this is how the data frame looks like:
> >>>
> >>>     PT_ID     IDX_DT   OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
> >>> 13   4549 2002-08-21 2002-08-20        -1       183        2
> >>> 14   4549 2002-08-21 2002-11-14        85        91        1
> >>> 15   4549 2002-08-21 2003-02-18       181        89        1
> >>> 16   4549 2002-08-21 2003-05-15       267       109        2
> >>> 17   4549 2002-08-21 2003-12-16       482        96        1
> >>> 128  4839 2006-11-28 2006-11-28         0       179        2
> >>>
> >>> I need to find, the single observation, which has the closest date of 'OBS_DATE' to
> 'IDX_DT'.
> >>>
> >>> For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one
> day away from IDX_DT.
> >>>
> >>> I was thinking about using abs(), and I got this:
> >>>
> >>> baseline<- function(x){
> >>> +
> >>> +  #remove all uncessary variables
> >>> +  baseline<- x[,c("PT_ID","DAYS_DIFF")]
> >>> +
> >>> +  #get a list of every unique ID
> >>> +  uniqueID <- unique(baseline$PT_ID)
> >>> +
> >>> +  #make a vector that will contain the smallest DAYS_DIFF
> >>> +  first <- rep(-99,length(uniqueID))
> >>> +
> >>> +  i = 1
> >>> +  #loop through each unique ID
> >>> +  for (PT_ID in uniqueID){
> >>> +
> >>> +  #for each iteration get the smallest DAYS_DIFF for that ID
> >>> +  first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
> >>> +
> >>> +  #up the iteration counter
> >>> +  i = i + 1
> >>> +
> >>> +  }
> >>> +  #make a data frame with the lowest DAYS_DIFF and ID
> >>> +  newdata <- data.frame(uniqueID,first)
> >>> +  names(newdata) <- c("PT_ID","DAYS_DIFF")
> >>> +
> >>> +  #return the data frame containing the lowest GPI for each ID
> >>> +  return(newdata)
> >>> +  }
> >>>
> >>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique#
> >>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
> abs(baseline$DAYS_DIFF)) :
> >>>   undefined columns selected
> >>>
> >>> Can anyone help me in figuring out how to get the minimum value of the absolute
> value of DAYS_DIFF for unique ID?
> >>>
> >>> Thanks a lot
> >>>     [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.