[R] R_closest date

Thu Sep 6 23:32:01 CEST 2012

Hi Weija,

You can also try this instead of the loop solution in my previous reply.

dat1<-read.table(text="
ID         PT_ID  BASE     IDX_DT   OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF   LDL_BASE  rf
118485 9624295 164.2 2006-11-21 2009-02-17     157.6        2  819  2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30     208.0        2  -52  2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09     123.8        2  505  2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26      17.4        1  462  2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26     139.0        2  462  2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02     107.2        2  223  2006-11-25 2.5
118491 9624295 164.2 2006-11-21 2007-02-27      86.0        1   98  2006-11-25 2.5
118492 9624295 164.2 2006-11-21 2008-09-09     131.2        2  658  2006-11-25 2.5
118485 9624296 164.2 2006-11-21 2009-02-17     157.6        2  819  2006-11-26 2.5
118486 9624296 164.2 2006-11-21 2006-09-30     208.0        2  -52  2006-11-26 2.5
118487 9624296 164.2 2006-11-21 2008-04-09     123.8        2  505  2006-11-26 2.5
118488 9624296 164.2 2006-11-21 2008-02-26      17.4        1  462  2006-11-27 2.5
118489 9624296 164.2 2006-11-21 2008-02-26     139.0        2  462  2006-11-27 2.5
118490 9624296 164.2 2006-11-21 2007-07-02     107.2        2  223  2006-11-27 2.5
118491 9624296 164.2 2006-11-21 2007-02-27      86.0        1   98  2006-11-27 2.5
118492 9624296 164.2 2006-11-21 2008-09-09     131.2        2  658  2006-11-27 2.5
 ",sep="",header=TRUE)

dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
dat3<-data.frame(do.call(rbind,lapply(dat2,`[`,c(6,8))))
dat4<-data.frame(LDL_Date.PTID=row.names(dat3),dat3)
row.names(dat4)<-1:nrow(dat4)
dat4$LDL_Date.PTID<-gsub("(.*..*)\\.\\d+","\\1",dat4$LDL_Date.PTID)
aggregate(dat4$OBS_VALUE,list(dat4$DAYS_DIFF,dat4$LDL_Date.PTID),mean)
Group.1            Group.2     x
#1      -52 2006-11-20.9624295 208.0
#2      462 2006-11-20.9624295  78.2
#3      505 2006-11-20.9624295 123.8
#4      819 2006-11-20.9624295 157.6
#5       98 2006-11-25.9624295  86.0
#6      223 2006-11-25.9624295 107.2
#7      658 2006-11-25.9624295 131.2
#8      -52 2006-11-26.9624296 208.0
#9      505 2006-11-26.9624296 123.8
#10     819 2006-11-26.9624296 157.6
#11      98 2006-11-27.9624296  86.0
#12     223 2006-11-27.9624296 107.2
#13     462 2006-11-27.9624296  78.2
#14     658 2006-11-27.9624296 131.2
A.K.

________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com> 
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date

Hi, Arun

Do you have idea about good package that split HUGE dataframe by two variables?

I was trying to use 'ddply' to calculate a mean of LDL-C values which had same date for every patient. 

Therefore, I need to break down my dataframe, first by patient ID, then by the date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a same day.

The example is:

         PT_ID  BASE     IDX_DT   OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF   LDL_BASE  rf
118485 9624295 164.2 2006-11-21 2009-02-17     157.6        2  819 days 2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30     208.0        2  -52 days 2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09     123.8        2  505 days 2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26      17.4        1  462 days 2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26     139.0        2  462 days 2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02     107.2        2  223 days 2006-11-20 2.5
118491 9624295 164.2 2006-11-21 2007-02-27      86.0        1   98 days 2006-11-20 2.5
118492 9624295 164.2 2006-11-21 2008-09-09     131.2        2  658 days 2006-11-20 2.5
         REDUCTION            GOAL  FAILURE
118485  0.04019488     NOT AT GOAL     FAIL
118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
118487  0.24604141         AT GOAL NOT FAIL
118488  0.89403167         AT GOAL NOT FAIL
118489  0.15347138     NOT AT GOAL     FAIL
118490  0.34713764         AT GOAL NOT FAIL
118491  0.47624848         AT GOAL NOT FAIL
118492  0.20097442     NOT AT GOAL NOT FAIL

So, this patient has two LDL-C readings on '462 days', therefore, I want to get a mean of these 17.4 and 139.0.

'ddply' did give me a mean when running on a test dataframe, but when I used it on my dataframe with 200,000ish observations, the computer run for like 5 hours and return error. Do you have idea about other good function, that focuses on split and apply function, and rbind?

Best
Weijia

On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:

That is actually a great idea, thanks again!
>
>Weijia Wang
>
>
>On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
>
>> Hi,
>> No problem.
>>
>> If you use join() instead of merge(), the original order of columns may not get altered.
>>
>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>> library(plyr)
>>  join(dat1,dat3,type="inner")
>> #Joining by: PT_ID, DAYS_DIFF
>>  # PT_ID     IDX_DT   OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>> #1  4549 2002-08-21 2002-08-20        -1       183        2
>> #2  4839 2006-11-28 2006-11-28         0       179        2
>> A.K.
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Weijia Wang <wwang.nyu at gmail.com>
>> To: arun <smartpink111 at yahoo.com>
>> Sent: Saturday, September 1, 2012 5:11 PM
>> Subject: Re: [R] R_closest date
>>
>>
>> Thank you Arun, for your help again.
>>
>> Best
>> ______________________________
>> WANG WEIJIA
>> Graudate Research and Teaching Assistant
>> Department of Environmental Medicine
>> New York University, School of Medicine
>> wwang.nyu at gmail.com
>>
>>
>>
>>
>> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote:
>>
>> Hi,
>>> Try this:
>>> dat1 <- read.table(text="
>>>   PT_ID    IDX_DT  OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13  4549 2002-08-21 2002-08-20        -1      183        2
>>> 14  4549 2002-08-21 2002-11-14        85        91        1
>>> 15  4549 2002-08-21 2003-02-18      181        89        1
>>> 16  4549 2002-08-21 2003-05-15      267      109        2
>>> 17  4549 2002-08-21 2003-12-16      482        96        1
>>> 128  4839 2006-11-28 2006-11-28        0      179        2
>>> ", header=TRUE)
>>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>>> merge(dat1,dat3)
>>> #  PT_ID DAYS_DIFF     IDX_DT   OBS_DATE OBS_VALUE CATEGORY
>>> #1  4549        -1 2002-08-21 2002-08-20       183        2
>>> #2  4839         0 2006-11-28 2006-11-28       179        2
>>>
>>> #or,
>>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
>>> dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
>>>  row.names(dat4)<-1:nrow(dat4)
>>> merge(dat1,dat4)
>>> #  PT_ID DAYS_DIFF     IDX_DT   OBS_DATE OBS_VALUE CATEGORY
>>> #1  4549        -1 2002-08-21 2002-08-20       183        2
>>> #2  4839         0 2006-11-28 2006-11-28       179        2
>>> A.K.
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: WANG WEIJIA <wwang.nyu at gmail.com>
>>> To: "r-help at R-project.org" <r-help at r-project.org>
>>> Cc:
>>> Sent: Saturday, September 1, 2012 1:10 PM
>>> Subject: [R] R_closest date
>>>
>>> Hi,
>>>
>>> I have encountered an issue about finding a date closest to another date
>>>
>>> So this is how the data frame looks like:
>>>
>>>     PT_ID     IDX_DT   OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13   4549 2002-08-21 2002-08-20        -1       183        2
>>> 14   4549 2002-08-21 2002-11-14        85        91        1
>>> 15   4549 2002-08-21 2003-02-18       181        89        1
>>> 16   4549 2002-08-21 2003-05-15       267       109        2
>>> 17   4549 2002-08-21 2003-12-16       482        96        1
>>> 128  4839 2006-11-28 2006-11-28         0       179        2
>>>
>>> I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'.
>>>
>>> For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT.
>>>
>>> I was thinking about using abs(), and I got this:
>>>
>>> baseline<- function(x){
>>> +
>>> +  #remove all uncessary variables
>>> +  baseline<- x[,c("PT_ID","DAYS_DIFF")]
>>> +
>>> +  #get a list of every unique ID
>>> +  uniqueID <- unique(baseline$PT_ID)
>>> +
>>> +  #make a vector that will contain the smallest DAYS_DIFF
>>> +  first <- rep(-99,length(uniqueID))
>>> +
>>> +  i = 1
>>> +  #loop through each unique ID
>>> +  for (PT_ID in uniqueID){
>>> +
>>> +  #for each iteration get the smallest DAYS_DIFF for that ID
>>> +  first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
>>> +
>>> +  #up the iteration counter
>>> +  i = i + 1
>>> +
>>> +  }
>>> +  #make a data frame with the lowest DAYS_DIFF and ID
>>> +  newdata <- data.frame(uniqueID,first)
>>> +  names(newdata) <- c("PT_ID","DAYS_DIFF")
>>> +
>>> +  #return the data frame containing the lowest GPI for each ID
>>> +  return(newdata)
>>> +  }
>>>
>>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique#
>>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) :
>>>   undefined columns selected
>>>
>>> Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID?
>>>
>>> Thanks a lot
>>>     [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>