[R] the first and last observation for each subject
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Fri Jan 2 16:45:47 CET 2009
Here is a fast approach using the Hmisc package's summarize function.
> g <- function(w) {
+ time <- w[,'time']; y <- w[,'y']
+ c(y[which.min(time)], y[which.max(time)])}
>
> with(DF, summarize(DF, ID, g, stat.name=c('first','last')))
ID first last
1 1 20 40
2 2 23 38
3 3 10 15
summarize converts DF to a matrix for speed, as subscripting matrices is
much faster than subscripting data frames.
Frank
hadley wickham wrote:
> On Fri, Jan 2, 2009 at 3:20 AM, gallon li <gallon.li at gmail.com> wrote:
>> I have the following data
>>
>> ID x y time
>> 1 10 20 0
>> 1 10 30 1
>> 1 10 40 2
>> 2 12 23 0
>> 2 12 25 1
>> 2 12 28 2
>> 2 12 38 3
>> 3 5 10 0
>> 3 5 15 2
>> .....
>>
>> x is time invariant, ID is the subject id number, y is changing over time.
>>
>> I want to find out the difference between the first and last observed y
>> value for each subject and get a table like
>>
>> ID x y
>> 1 10 20
>> 2 12 15
>> 3 5 5
>> ......
>>
>> Is there any easy way to generate the data set?
>
> One approach is to use the plyr package, as documented at
> http://had.co.nz/plyr. The basic idea is that your problem is easy to
> solve if you have a subset for a single subject value:
>
> one <- subset(DF, ID == 1)
> with(one, y[length(y)] - y[1])
>
> The difficulty is splitting up the original dataset in to subjects,
> applying the solution to each piece and then joining all the results
> back together. This is what the plyr package does for you:
>
> library(plyr)
>
> # ddply is for splitting up data frames and combining the results
> # into a data frame. .(ID) says to split up the data frame by the subject
> # variable
> ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
>
> # if you want a more informative variable name in the result
> # return a named vector:
> ddply(DF, .(ID), function(one) c(diff = with(one, y[length(y)] - y[1])))
>
> # plyr takes care of labelling the result for you.
>
> You don't say why you want to include x, or what to do if x is not
> invariant, but here are couple of options:
>
> # Split up by ID and x
> ddply(DF, .(ID, x), function(one) c(diff = with(one, y[length(y)] - y[1])))
>
> # Return the first x value
> ddply(DF, .(ID), function(one) {
> with(one, c(
> x = x[1],
> diff = y[length(y)] - y[1]
> ))
> })
>
> # Throw an error is x is not unique
>
> ddply(DF, .(ID), function(one) {
> stopifnot(length(unique(one$x)) == 1)
> with(one, c(
> x = x[1],
> diff = y[length(y)] - y[1]
> ))
> })
>
> Regards,
>
> Hadley
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list