[R] R_closest date
William Dunlap
wdunlap at tibco.com
Thu Sep 6 22:45:37 CEST 2012
Try using ave(), as in
dat1WithMean <- within(dat1, Mean <- ave(OBS_VALUE, PT_ID, DAYS_DIFF, FUN = mean))
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of arun
> Sent: Thursday, September 06, 2012 12:21 PM
> To: Weijia Wang
> Cc: R help
> Subject: Re: [R] R_closest date
>
>
>
> HI,
>
> For your question to split by two variables, try this:
> dat1<-read.table(text="
> ID PT_ID BASE IDX_DT OBS_DATE OBS_VALUE CATEGORY
> DAYS_DIFF LDL_BASE rf
> 118485 9624295 164.2 2006-11-21 2009-02-17 157.6 2 819 2006-11-20 2.5
> 118486 9624295 164.2 2006-11-21 2006-09-30 208.0 2 -52 2006-11-20 2.5
> 118487 9624295 164.2 2006-11-21 2008-04-09 123.8 2 505 2006-11-20 2.5
> 118488 9624295 164.2 2006-11-21 2008-02-26 17.4 1 462 2006-11-20 2.5
> 118489 9624295 164.2 2006-11-21 2008-02-26 139.0 2 462 2006-11-20 2.5
> 118490 9624295 164.2 2006-11-21 2007-07-02 107.2 2 223 2006-11-25 2.5
> 118491 9624295 164.2 2006-11-21 2007-02-27 86.0 1 98 2006-11-25 2.5
> 118492 9624295 164.2 2006-11-21 2008-09-09 131.2 2 658 2006-11-25 2.5
> 118485 9624296 164.2 2006-11-21 2009-02-17 157.6 2 819 2006-11-26 2.5
> 118486 9624296 164.2 2006-11-21 2006-09-30 208.0 2 -52 2006-11-26 2.5
> 118487 9624296 164.2 2006-11-21 2008-04-09 123.8 2 505 2006-11-26 2.5
> 118488 9624296 164.2 2006-11-21 2008-02-26 17.4 1 462 2006-11-27 2.5
> 118489 9624296 164.2 2006-11-21 2008-02-26 139.0 2 462 2006-11-27 2.5
> 118490 9624296 164.2 2006-11-21 2007-07-02 107.2 2 223 2006-11-27 2.5
> 118491 9624296 164.2 2006-11-21 2007-02-27 86.0 1 98 2006-11-27 2.5
> 118492 9624296 164.2 2006-11-21 2008-09-09 131.2 2 658 2006-11-27 2.5
> ",sep="",header=TRUE)
>
> dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
> dat3<-list()
> for(i in seq_along(dat2)){
> dat3[[i]]<-list()
> dat3[[i]]<-ddply(dat2[[i]],.(DAYS_DIFF),summarize,Mean=mean(OBS_VALUE))
> }
> dat3
> do.call(rbind,dat3)
> # DAYS_DIFF Mean
> #1 -52 208.0
> #2 462 78.2
> #3 505 123.8
> #4 819 157.6
> #5 98 86.0
> #6 223 107.2
> #7 658 131.2
> #8 -52 208.0
> #9 505 123.8
> #10 819 157.6
> #11 98 86.0
> #12 223 107.2
> #13 462 78.2
> #14 658 131.2
>
>
> #Not sure whether this will work or not in your huge dataset. May be you can try lapply()
> also.
>
> A.K.
> ________________________________
> From: Weijia Wang <wwang.nyu at gmail.com>
> To: arun <smartpink111 at yahoo.com>
> Sent: Thursday, September 6, 2012 12:21 PM
> Subject: Re: [R] R_closest date
>
>
> Hi, Arun
>
> Do you have idea about good package that split HUGE dataframe by two variables?
>
> I was trying to use 'ddply' to calculate a mean of LDL-C values which had same date for
> every patient.
>
> Therefore, I need to break down my dataframe, first by patient ID, then by the date of
> the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a same day.
>
> The example is:
>
> PT_ID BASE IDX_DT OBS_DATE OBS_VALUE CATEGORY
> DAYS_DIFF LDL_BASE rf
> 118485 9624295 164.2 2006-11-21 2009-02-17 157.6 2 819 days 2006-11-20 2.5
> 118486 9624295 164.2 2006-11-21 2006-09-30 208.0 2 -52 days 2006-11-20 2.5
> 118487 9624295 164.2 2006-11-21 2008-04-09 123.8 2 505 days 2006-11-20 2.5
> 118488 9624295 164.2 2006-11-21 2008-02-26 17.4 1 462 days 2006-11-20 2.5
> 118489 9624295 164.2 2006-11-21 2008-02-26 139.0 2 462 days 2006-11-20 2.5
> 118490 9624295 164.2 2006-11-21 2007-07-02 107.2 2 223 days 2006-11-20 2.5
> 118491 9624295 164.2 2006-11-21 2007-02-27 86.0 1 98 days 2006-11-20 2.5
> 118492 9624295 164.2 2006-11-21 2008-09-09 131.2 2 658 days 2006-11-20 2.5
> REDUCTION GOAL FAILURE
> 118485 0.04019488 NOT AT GOAL FAIL
> 118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
> 118487 0.24604141 AT GOAL NOT FAIL
> 118488 0.89403167 AT GOAL NOT FAIL
> 118489 0.15347138 NOT AT GOAL FAIL
> 118490 0.34713764 AT GOAL NOT FAIL
> 118491 0.47624848 AT GOAL NOT FAIL
> 118492 0.20097442 NOT AT GOAL NOT FAIL
>
> So, this patient has two LDL-C readings on '462 days', therefore, I want to get a mean of
> these 17.4 and 139.0.
>
> 'ddply' did give me a mean when running on a test dataframe, but when I used it on my
> dataframe with 200,000ish observations, the computer run for like 5 hours and return
> error. Do you have idea about other good function, that focuses on split and apply
> function, and rbind?
>
> Best
> Weijia
>
>
> On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:
>
> That is actually a great idea, thanks again!
> >
> >Weijia Wang
> >
> >
> >On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
> >
> >> Hi,
> >> No problem.
> >>
> >> If you use join() instead of merge(), the original order of columns may not get altered.
> >>
> >> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
> >> library(plyr)
> >> join(dat1,dat3,type="inner")
> >> #Joining by: PT_ID, DAYS_DIFF
> >> # PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
> >> #1 4549 2002-08-21 2002-08-20 -1 183 2
> >> #2 4839 2006-11-28 2006-11-28 0 179 2
> >> A.K.
> >>
> >>
> >>
> >>
> >>
> >>
> >> ________________________________
> >> From: Weijia Wang <wwang.nyu at gmail.com>
> >> To: arun <smartpink111 at yahoo.com>
> >> Sent: Saturday, September 1, 2012 5:11 PM
> >> Subject: Re: [R] R_closest date
> >>
> >>
> >> Thank you Arun, for your help again.
> >>
> >> Best
> >> ______________________________
> >> WANG WEIJIA
> >> Graudate Research and Teaching Assistant
> >> Department of Environmental Medicine
> >> New York University, School of Medicine
> >> wwang.nyu at gmail.com
> >>
> >>
> >>
> >>
> >> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote:
> >>
> >> Hi,
> >>> Try this:
> >>> dat1 <- read.table(text="
> >>> PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
> >>> 13 4549 2002-08-21 2002-08-20 -1 183 2
> >>> 14 4549 2002-08-21 2002-11-14 85 91 1
> >>> 15 4549 2002-08-21 2003-02-18 181 89 1
> >>> 16 4549 2002-08-21 2003-05-15 267 109 2
> >>> 17 4549 2002-08-21 2003-12-16 482 96 1
> >>> 128 4839 2006-11-28 2006-11-28 0 179 2
> >>> ", header=TRUE)
> >>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
> >>> merge(dat1,dat3)
> >>> # PT_ID DAYS_DIFF IDX_DT OBS_DATE OBS_VALUE CATEGORY
> >>> #1 4549 -1 2002-08-21 2002-08-20 183 2
> >>> #2 4839 0 2006-11-28 2006-11-28 179 2
> >>>
> >>> #or,
> >>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
> >>> dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
> >>> row.names(dat4)<-1:nrow(dat4)
> >>> merge(dat1,dat4)
> >>> # PT_ID DAYS_DIFF IDX_DT OBS_DATE OBS_VALUE CATEGORY
> >>> #1 4549 -1 2002-08-21 2002-08-20 183 2
> >>> #2 4839 0 2006-11-28 2006-11-28 179 2
> >>> A.K.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>> From: WANG WEIJIA <wwang.nyu at gmail.com>
> >>> To: "r-help at R-project.org" <r-help at r-project.org>
> >>> Cc:
> >>> Sent: Saturday, September 1, 2012 1:10 PM
> >>> Subject: [R] R_closest date
> >>>
> >>> Hi,
> >>>
> >>> I have encountered an issue about finding a date closest to another date
> >>>
> >>> So this is how the data frame looks like:
> >>>
> >>> PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
> >>> 13 4549 2002-08-21 2002-08-20 -1 183 2
> >>> 14 4549 2002-08-21 2002-11-14 85 91 1
> >>> 15 4549 2002-08-21 2003-02-18 181 89 1
> >>> 16 4549 2002-08-21 2003-05-15 267 109 2
> >>> 17 4549 2002-08-21 2003-12-16 482 96 1
> >>> 128 4839 2006-11-28 2006-11-28 0 179 2
> >>>
> >>> I need to find, the single observation, which has the closest date of 'OBS_DATE' to
> 'IDX_DT'.
> >>>
> >>> For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one
> day away from IDX_DT.
> >>>
> >>> I was thinking about using abs(), and I got this:
> >>>
> >>> baseline<- function(x){
> >>> +
> >>> + #remove all uncessary variables
> >>> + baseline<- x[,c("PT_ID","DAYS_DIFF")]
> >>> +
> >>> + #get a list of every unique ID
> >>> + uniqueID <- unique(baseline$PT_ID)
> >>> +
> >>> + #make a vector that will contain the smallest DAYS_DIFF
> >>> + first <- rep(-99,length(uniqueID))
> >>> +
> >>> + i = 1
> >>> + #loop through each unique ID
> >>> + for (PT_ID in uniqueID){
> >>> +
> >>> + #for each iteration get the smallest DAYS_DIFF for that ID
> >>> + first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
> >>> +
> >>> + #up the iteration counter
> >>> + i = i + 1
> >>> +
> >>> + }
> >>> + #make a data frame with the lowest DAYS_DIFF and ID
> >>> + newdata <- data.frame(uniqueID,first)
> >>> + names(newdata) <- c("PT_ID","DAYS_DIFF")
> >>> +
> >>> + #return the data frame containing the lowest GPI for each ID
> >>> + return(newdata)
> >>> + }
> >>>
> >>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique#
> >>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
> abs(baseline$DAYS_DIFF)) :
> >>> undefined columns selected
> >>>
> >>> Can anyone help me in figuring out how to get the minimum value of the absolute
> value of DAYS_DIFF for unique ID?
> >>>
> >>> Thanks a lot
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list