[R] the first and last observation for each subject

hadley wickham h.wickham at gmail.com
Mon Jan 5 05:55:37 CET 2009


>> library(plyr)
>>
>> # ddply is for splitting up data frames and combining the results
>> # into a data frame.  .(ID) says to split up the data frame by the
> subject
>> # variable
>> ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
>> ...
>
> The above is much quicker than the versions based on aggregate and

plyr does make some optimisations to increase speed and decrease
memory usage (mainly by passing around lists of indices, rather than
lists of the original objects) but it's unlikely ever to approach the
speed of a pure vector approach (although I hope to put some time into
rewriting the slow parts in C to do better with performance).

> easy to understand.  Another approach is more specialized but useful
> when you have lots of ID's (e.g., millions) and speed is very important.
> It computes where the first and last entry for each ID in a vectorized
> computation, akin to the computation that rle() uses:

I particularly this solution to the problem - it's a very handy
technique, and while it takes a while to get your head around how it
works, it's worthwhile spending the time to do so because it crops up
as a useful solution to many similar types of problems. (It can be
particularly useful in excel too, as a quick way of locating
boundaries between groups)

Hadley

-- 
http://had.co.nz/




More information about the R-help mailing list