[R] R_closest date
arun
smartpink111 at yahoo.com
Thu Sep 6 23:32:01 CEST 2012
Hi Weija,
You can also try this instead of the loop solution in my previous reply.
dat1<-read.table(text="
ID PT_ID BASE IDX_DT OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF LDL_BASE rf
118485 9624295 164.2 2006-11-21 2009-02-17 157.6 2 819 2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30 208.0 2 -52 2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09 123.8 2 505 2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26 17.4 1 462 2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26 139.0 2 462 2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02 107.2 2 223 2006-11-25 2.5
118491 9624295 164.2 2006-11-21 2007-02-27 86.0 1 98 2006-11-25 2.5
118492 9624295 164.2 2006-11-21 2008-09-09 131.2 2 658 2006-11-25 2.5
118485 9624296 164.2 2006-11-21 2009-02-17 157.6 2 819 2006-11-26 2.5
118486 9624296 164.2 2006-11-21 2006-09-30 208.0 2 -52 2006-11-26 2.5
118487 9624296 164.2 2006-11-21 2008-04-09 123.8 2 505 2006-11-26 2.5
118488 9624296 164.2 2006-11-21 2008-02-26 17.4 1 462 2006-11-27 2.5
118489 9624296 164.2 2006-11-21 2008-02-26 139.0 2 462 2006-11-27 2.5
118490 9624296 164.2 2006-11-21 2007-07-02 107.2 2 223 2006-11-27 2.5
118491 9624296 164.2 2006-11-21 2007-02-27 86.0 1 98 2006-11-27 2.5
118492 9624296 164.2 2006-11-21 2008-09-09 131.2 2 658 2006-11-27 2.5
",sep="",header=TRUE)
dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
dat3<-data.frame(do.call(rbind,lapply(dat2,`[`,c(6,8))))
dat4<-data.frame(LDL_Date.PTID=row.names(dat3),dat3)
row.names(dat4)<-1:nrow(dat4)
dat4$LDL_Date.PTID<-gsub("(.*..*)\\.\\d+","\\1",dat4$LDL_Date.PTID)
aggregate(dat4$OBS_VALUE,list(dat4$DAYS_DIFF,dat4$LDL_Date.PTID),mean)
Group.1 Group.2 x
#1 -52 2006-11-20.9624295 208.0
#2 462 2006-11-20.9624295 78.2
#3 505 2006-11-20.9624295 123.8
#4 819 2006-11-20.9624295 157.6
#5 98 2006-11-25.9624295 86.0
#6 223 2006-11-25.9624295 107.2
#7 658 2006-11-25.9624295 131.2
#8 -52 2006-11-26.9624296 208.0
#9 505 2006-11-26.9624296 123.8
#10 819 2006-11-26.9624296 157.6
#11 98 2006-11-27.9624296 86.0
#12 223 2006-11-27.9624296 107.2
#13 462 2006-11-27.9624296 78.2
#14 658 2006-11-27.9624296 131.2
A.K.
________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com>
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date
Hi, Arun
Do you have idea about good package that split HUGE dataframe by two variables?
I was trying to use 'ddply' to calculate a mean of LDL-C values which had same date for every patient.
Therefore, I need to break down my dataframe, first by patient ID, then by the date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a same day.
The example is:
PT_ID BASE IDX_DT OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF LDL_BASE rf
118485 9624295 164.2 2006-11-21 2009-02-17 157.6 2 819 days 2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30 208.0 2 -52 days 2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09 123.8 2 505 days 2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26 17.4 1 462 days 2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26 139.0 2 462 days 2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02 107.2 2 223 days 2006-11-20 2.5
118491 9624295 164.2 2006-11-21 2007-02-27 86.0 1 98 days 2006-11-20 2.5
118492 9624295 164.2 2006-11-21 2008-09-09 131.2 2 658 days 2006-11-20 2.5
REDUCTION GOAL FAILURE
118485 0.04019488 NOT AT GOAL FAIL
118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
118487 0.24604141 AT GOAL NOT FAIL
118488 0.89403167 AT GOAL NOT FAIL
118489 0.15347138 NOT AT GOAL FAIL
118490 0.34713764 AT GOAL NOT FAIL
118491 0.47624848 AT GOAL NOT FAIL
118492 0.20097442 NOT AT GOAL NOT FAIL
So, this patient has two LDL-C readings on '462 days', therefore, I want to get a mean of these 17.4 and 139.0.
'ddply' did give me a mean when running on a test dataframe, but when I used it on my dataframe with 200,000ish observations, the computer run for like 5 hours and return error. Do you have idea about other good function, that focuses on split and apply function, and rbind?
Best
Weijia
On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:
That is actually a great idea, thanks again!
>
>Weijia Wang
>
>
>On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
>
>> Hi,
>> No problem.
>>
>> If you use join() instead of merge(), the original order of columns may not get altered.
>>
>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>> library(plyr)
>> join(dat1,dat3,type="inner")
>> #Joining by: PT_ID, DAYS_DIFF
>> # PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>> #1 4549 2002-08-21 2002-08-20 -1 183 2
>> #2 4839 2006-11-28 2006-11-28 0 179 2
>> A.K.
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Weijia Wang <wwang.nyu at gmail.com>
>> To: arun <smartpink111 at yahoo.com>
>> Sent: Saturday, September 1, 2012 5:11 PM
>> Subject: Re: [R] R_closest date
>>
>>
>> Thank you Arun, for your help again.
>>
>> Best
>> ______________________________
>> WANG WEIJIA
>> Graudate Research and Teaching Assistant
>> Department of Environmental Medicine
>> New York University, School of Medicine
>> wwang.nyu at gmail.com
>>
>>
>>
>>
>> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote:
>>
>> Hi,
>>> Try this:
>>> dat1 <- read.table(text="
>>> PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 4549 2002-08-21 2002-08-20 -1 183 2
>>> 14 4549 2002-08-21 2002-11-14 85 91 1
>>> 15 4549 2002-08-21 2003-02-18 181 89 1
>>> 16 4549 2002-08-21 2003-05-15 267 109 2
>>> 17 4549 2002-08-21 2003-12-16 482 96 1
>>> 128 4839 2006-11-28 2006-11-28 0 179 2
>>> ", header=TRUE)
>>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>>> merge(dat1,dat3)
>>> # PT_ID DAYS_DIFF IDX_DT OBS_DATE OBS_VALUE CATEGORY
>>> #1 4549 -1 2002-08-21 2002-08-20 183 2
>>> #2 4839 0 2006-11-28 2006-11-28 179 2
>>>
>>> #or,
>>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
>>> dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
>>> row.names(dat4)<-1:nrow(dat4)
>>> merge(dat1,dat4)
>>> # PT_ID DAYS_DIFF IDX_DT OBS_DATE OBS_VALUE CATEGORY
>>> #1 4549 -1 2002-08-21 2002-08-20 183 2
>>> #2 4839 0 2006-11-28 2006-11-28 179 2
>>> A.K.
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: WANG WEIJIA <wwang.nyu at gmail.com>
>>> To: "r-help at R-project.org" <r-help at r-project.org>
>>> Cc:
>>> Sent: Saturday, September 1, 2012 1:10 PM
>>> Subject: [R] R_closest date
>>>
>>> Hi,
>>>
>>> I have encountered an issue about finding a date closest to another date
>>>
>>> So this is how the data frame looks like:
>>>
>>> PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 4549 2002-08-21 2002-08-20 -1 183 2
>>> 14 4549 2002-08-21 2002-11-14 85 91 1
>>> 15 4549 2002-08-21 2003-02-18 181 89 1
>>> 16 4549 2002-08-21 2003-05-15 267 109 2
>>> 17 4549 2002-08-21 2003-12-16 482 96 1
>>> 128 4839 2006-11-28 2006-11-28 0 179 2
>>>
>>> I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'.
>>>
>>> For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT.
>>>
>>> I was thinking about using abs(), and I got this:
>>>
>>> baseline<- function(x){
>>> +
>>> + #remove all uncessary variables
>>> + baseline<- x[,c("PT_ID","DAYS_DIFF")]
>>> +
>>> + #get a list of every unique ID
>>> + uniqueID <- unique(baseline$PT_ID)
>>> +
>>> + #make a vector that will contain the smallest DAYS_DIFF
>>> + first <- rep(-99,length(uniqueID))
>>> +
>>> + i = 1
>>> + #loop through each unique ID
>>> + for (PT_ID in uniqueID){
>>> +
>>> + #for each iteration get the smallest DAYS_DIFF for that ID
>>> + first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
>>> +
>>> + #up the iteration counter
>>> + i = i + 1
>>> +
>>> + }
>>> + #make a data frame with the lowest DAYS_DIFF and ID
>>> + newdata <- data.frame(uniqueID,first)
>>> + names(newdata) <- c("PT_ID","DAYS_DIFF")
>>> +
>>> + #return the data frame containing the lowest GPI for each ID
>>> + return(newdata)
>>> + }
>>>
>>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique#
>>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) :
>>> undefined columns selected
>>>
>>> Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID?
>>>
>>> Thanks a lot
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>
More information about the R-help
mailing list