[R] the first and last observation for each subject

Fri Jan 2 16:45:47 CET 2009

Here is a fast approach using the Hmisc package's summarize function.

 > g <- function(w) {
+  time <- w[,'time']; y <- w[,'y']
+  c(y[which.min(time)], y[which.max(time)])}
 >
 > with(DF, summarize(DF, ID, g, stat.name=c('first','last')))
   ID first last
1  1    20   40
2  2    23   38
3  3    10   15

summarize converts DF to a matrix for speed, as subscripting matrices is 
much faster than subscripting data frames.

Frank

hadley wickham wrote:
> On Fri, Jan 2, 2009 at 3:20 AM, gallon li <gallon.li at gmail.com> wrote:
>> I have the following data
>>
>> ID x y time
>> 1  10 20 0
>> 1  10 30 1
>> 1 10 40 2
>> 2 12 23 0
>> 2 12 25 1
>> 2 12 28 2
>> 2 12 38 3
>> 3 5 10 0
>> 3 5 15 2
>> .....
>>
>> x is time invariant, ID is the subject id number, y is changing over time.
>>
>> I want to find out the difference between the first and last observed y
>> value for each subject and get a table like
>>
>> ID x y
>> 1 10 20
>> 2 12 15
>> 3 5 5
>> ......
>>
>> Is there any easy way to generate the data set?
> 
> One approach is to use the plyr package, as documented at
> http://had.co.nz/plyr.  The basic idea is that your problem is easy to
> solve if you have a subset for a single subject value:
> 
> one <- subset(DF, ID == 1)
> with(one, y[length(y)] - y[1])
> 
> The difficulty is splitting up the original dataset in to subjects,
> applying the solution to each piece and then joining all the results
> back together.  This is what the plyr package does for you:
> 
> library(plyr)
> 
> # ddply is for splitting up data frames and combining the results
> # into a data frame.  .(ID) says to split up the data frame by the subject
> # variable
> ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
> 
> # if you want a more informative variable name in the result
> # return a named vector:
> ddply(DF, .(ID), function(one) c(diff = with(one, y[length(y)] - y[1])))
> 
> # plyr takes care of labelling the result for you.
> 
> You don't say why you want to include x, or what to do if x is not
> invariant, but here are couple of options:
> 
> # Split up by ID and x
> ddply(DF, .(ID, x), function(one) c(diff = with(one, y[length(y)] - y[1])))
> 
> # Return the first x value
> ddply(DF, .(ID), function(one) {
>   with(one, c(
>     x = x[1],
>     diff = y[length(y)] - y[1]
>   ))
> })
> 
> # Throw an error is x is not unique
> 
> ddply(DF, .(ID), function(one) {
>   stopifnot(length(unique(one$x)) == 1)
>   with(one, c(
>     x = x[1],
>     diff = y[length(y)] - y[1]
>   ))
> })
> 
> Regards,
> 
> Hadley
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University