[R] How to speed up or avoid the for-loops in this example?
Tim Churches
tchur at optushome.com.au
Thu Feb 15 04:25:07 CET 2007
jim holtman wrote:
> On 2/14/07, Tim Churches <tchur at optushome.com.au> wrote:
>> Any advice, tips, clues or pointers to resources on how best to speed up
>> or, better still, avoid the loops in the following example code much
>> appreciated. My actual dataset has several tens of thousands of rows and
>> lots of columns, and these loops take a rather long time to run.
>> Everything else which I need to do is done using vectors and those parts
>> all run very quickly indeed. I spent quite a while doing searches on
>> r-help and re-reading the various manuals, but couldn't find any
>> existing relevant advice. I am sure the solution is obvious, but it
>> escapes me.
>>
>> Tim C
>>
>> # create an example data frame, multiple events per subject
>>
>> year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983)
>> event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F)
>> subject <- c(1,1,1,2,2,3,3,3,3,4,4)
>> df <- data.frame(cbind(subject,year,event.of.interest))
>>
>> # add a per-subject sequence number
>>
>> df$subject.seq <- 1
>> for (i in 2:nrow(df)) {
>> if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <-
>> df$subject.seq[i-1] + 1
>> }
>> df
>
> # add an event sequence number which is zero until the first
>> # event of interest for that subject happens, and then increments
>> # thereafter
>>
>> df$event.seq <- 0
>> for (i in 1:nrow(df)) {
>> if (df$subject.seq[i] == 1 ) {
>> current.event.seq <- 0
>> }
>> if (event.of.interest[i] == 1 | current.event.seq > 0)
>> current.event.seq <- current.event.seq + 1
>> df$event.seq[i] <- current.event.seq
>> }
>> df
>
>
>
> try:
>
>> df <- data.frame(cbind(subject,year,event.of.interest))
>> df <- do.call(rbind,by(df, df$subject, function(z){z$subject.seq <-
> seq(nrow(z)); z}))
>> df
> subject year event.of.interest subject.seq
> 1.1 1 1980 0 1
> 1.2 1 1982 1 2
> 1.3 1 1996 1 3
> 2.4 2 1985 0 1
> 2.5 2 1987 0 2
> 3.6 3 1990 0 1
> 3.7 3 1991 1 2
> 3.8 3 1992 0 3
> 3.9 3 1999 1 4
> 4.10 4 1972 1 1
> 4.11 4 1983 0 2
>> # determine first event
>> df <- do.call(rbind, by(df, df$subject, function(x){
> + # determine first event
> + .first <- cumsum(x$event.of.interest)
> + # create sequence after first non-zero
> + .first <- cumsum(.first > 0)
> + x$event.seq <- .first
> + x
> + }))
>> df
> subject year event.of.interest subject.seq event.seq
> 1.1.1 1 1980 0 1 0
> 1.1.2 1 1982 1 2 1
> 1.1.3 1 1996 1 3 2
> 2.2.4 2 1985 0 1 0
> 2.2.5 2 1987 0 2 0
> 3.3.6 3 1990 0 1 0
> 3.3.7 3 1991 1 2 1
> 3.3.8 3 1992 0 3 2
> 3.3.9 3 1999 1 4 3
> 4.4.10 4 1972 1 1 1
> 4.4.11 4 1983 0 2 2
Thanks Jim, that works a treat, over an order of magnitude faster than
the for-loops.
Anders Nielsen also provided this solution:
df$subject.seq<-unlist(tapply(df$subject,
df$subject,
function(x)1:length(x)
)
)
Doing it that way is about 5 times faster than using rbind(). But Jim's
use of cumsum on the logical vector is very nifty.
I have now combined Jim's function with Anders' column-oriented approach
and the result is that my code now runs about two orders of magnitude
faster.
Many thanks,
Tim C
More information about the R-help
mailing list