[R] the first and last observation for each subject

William Dunlap wdunlap at tibco.com
Mon Jan 5 05:24:29 CET 2009


> [R] the first and last observation for each subject
> hadley wickham h.wickham at gmail.com
> Fri Jan 2 14:52:42 CET 2009
> 
> On Fri, Jan 2, 2009 at 3:20 AM, gallon li <gallon.li at gmail.com>
wrote:
> > I have the following data
> >
> > ID x y time
> > 1  10 20 0
> > 1  10 30 1
> > 1 10 40 2
> > 2 12 23 0
> > 2 12 25 1
> > 2 12 28 2
> > 2 12 38 3
> > 3 5 10 0
> > 3 5 15 2
> > .....
> >
> > x is time invariant, ID is the subject id number, y is changing over
time.
> >
> > I want to find out the difference between the first and last
observed y
> > value for each subject and get a table like
> >
> > ID x y
> > 1 10 20
> > 2 12 15
> > 3 5 5
> > ......
> >
> > Is there any easy way to generate the data set?
> 
> One approach is to use the plyr package, as documented at
> http://had.co.nz/plyr.  The basic idea is that your problem is easy to
> solve if you have a subset for a single subject value:
> 
> one <- subset(DF, ID == 1)
> with(one, y[length(y)] - y[1])
> 
> The difficulty is splitting up the original dataset in to subjects,
> applying the solution to each piece and then joining all the results
> back together.  This is what the plyr package does for you:
> 
> library(plyr)
> 
> # ddply is for splitting up data frames and combining the results
> # into a data frame.  .(ID) says to split up the data frame by the
subject
> # variable
> ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
> ...

The above is much quicker than the versions based on aggregate and
easy to understand.  Another approach is more specialized but useful
when you have lots of ID's (e.g., millions) and speed is very important.
It computes where the first and last entry for each ID in a vectorized
computation, akin to the computation that rle() uses:

f0 <- 
function(DF){
   changes <- DF$ID[-1] != DF$ID[-length(DF$ID)]
   first <- c(TRUE, changes)
   last <- c(changes, TRUE)
   ydiff <- DF$y[last] - DF$y[first]
   DF <- DF[first,]
   DF$y <- ydiff
   DF
}


Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 




More information about the R-help mailing list