[R] long format - find age when another variable is first 'high'

Mon May 25 15:52:15 CEST 2009

On May 25, 2009, at 7:45 AM, David Freedman wrote:

>
> Dear R,
>
> I've got a data frame with children examined multiple times and at  
> various
> ages.  I'm trying to find the first age at which another variable
> (LDL-Cholesterol) is >= 130 mg/dL; for some children, this may never  
> happen.
> I can do this with transformBy and ddply, but with 10,000 different
> children, these functions take some time on my PCs - is there a  
> faster way
> to do this in R?  My code on a small dataset follows.
>
> Thanks very much, David Freedman
>
> d<-data.frame(id=c(rep(1,3),rep(2,2), 
> 3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160))
> d$high.ldlc<-ifelse(d$ldlc>=130,1,0)
> d
> library(plyr)
> d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1]));
> library(doBy)
> d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1]));
> d2

The first thing that I would do is to get rid of records that are not  
relevant to your question:

 > d
id age ldlc high.ldlc
1  1   5  132         1
2  1  10  120         0
3  1  15  125         0
4  2   4  105         0
5  2   7  142         1
6  3  12  160         1

# Get records with high ldl
d.new <- subset(d, ldlc >= 130)

 > d.new
id age ldlc high.ldlc
1  1   5  132         1
5  2   7  142         1
6  3  12  160         1

That will help to reduce the total size of the dataset, perhaps  
substantially. It will also remove entire subjects that are not  
relevant (eg. never have LDL >= 130).

Then get the minimum age for each of the remaining subjects:

 > aggregate(d.new$age, list(id = d.new$id), min)
id  x
1  1  5
2  2  7
3  3 12

Try that to see what sort of time reduction you observe.

HTH,

Marc Schwartz