[R] How to speed up or avoid the for-loops in this example?

Thu Feb 15 03:48:50 CET 2007

On Thu, 2007-02-15 at 12:24 +1100, Tim Churches wrote:
> Any advice, tips, clues or pointers to resources on how best to speed up
> or, better still, avoid the loops in the following example code much
> appreciated. My actual dataset has several tens of thousands of rows and
> lots of columns, and these loops take a rather long time to run.
> Everything else which I need to do is done using vectors and those parts
> all run very quickly indeed. I spent quite a while doing searches on
> r-help and re-reading the various manuals, but couldn't find any
> existing relevant advice. I am sure the solution is obvious, but it
> escapes me.
> 
> Tim C
> 
> # create an example data frame, multiple events per subject
> 
> year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983)
> event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F)
> subject <- c(1,1,1,2,2,3,3,3,3,4,4)
> df <- data.frame(cbind(subject,year,event.of.interest))
> 
> # add a per-subject sequence number
> 
> df$subject.seq <- 1
> for (i in 2:nrow(df)) {
>  if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <-
> df$subject.seq[i-1] + 1
> }
> df
> 
> # add an event sequence number which is zero until the first
> # event of interest for that subject happens, and then increments
> # thereafter
> 
> df$event.seq <- 0
> for (i in 1:nrow(df)) {
>  if (df$subject.seq[i] == 1 ) {
>     current.event.seq <- 0
>  }
>  if (event.of.interest[i] == 1 | current.event.seq > 0)
> current.event.seq <- current.event.seq + 1
>  df$event.seq[i] <- current.event.seq
> }
> df

OK, here is one possible solution, though perhaps with a bit more time,
there may be more optimal approaches. 

Using your example data above, but first noting that you do not want to
use:

  df <- data.frame(cbind(subject,year,event.of.interest))

Using cbind() first, creates a matrix and causes all columns to be
coerced to a common data type, obviating the benefit of data frames to
be able to handle multiple data types. For example:

> str(df)
'data.frame':	11 obs. of  3 variables:
 $ subject          : num  1 1 1 2 2 3 3 3 3 4 ...
 $ year             : num  1980 1982 1996 1985 1987 ...
 $ event.of.interest: num  0 1 1 0 0 0 1 0 1 1 ...

Note that your column "event.of.interest" is coerced to a numeric,
rather than staying as a logical.

Thus, use:

df <- data.frame(subject, year, event.of.interest)

> str(df)
'data.frame':	11 obs. of  3 variables:
 $ subject          : num  1 1 1 2 2 3 3 3 3 4 ...
 $ year             : num  1980 1982 1996 1985 1987 ...
 $ event.of.interest: logi  FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE

So, now on to the solution:

# First, order the data frame by increasing order of
# subject number and decreasing order for event.of.interest
# This ensures that these columns are properly sorted
# to facilitate the subsequent code. 

df <- df[order(df$subject, -df$event.of.interest), ]

So, 'df' will look like:

> df
   subject year event.of.interest
2        1 1982              TRUE
3        1 1996              TRUE
1        1 1980             FALSE
4        2 1985             FALSE
5        2 1987             FALSE
7        3 1991              TRUE
9        3 1999              TRUE
6        3 1990             FALSE
8        3 1992             FALSE
10       4 1972              TRUE
11       4 1983             FALSE

# Now use the combinations of sapply(), rle(), seq() and unlist() to
# generate per subject sequences. Note that rle() returns:
#
# > rle(df$subject)
# Run Length Encoding
#   lengths: int [1:4] 3 2 4 2
#   values : num [1:4] 1 2 3 4
#
# See ?rle, ?seq, ?sapply and ?unlist

df$subject.seq <- unlist(sapply(rle(df$subject)$lengths, 
                                function(x) seq(x)))

So, 'df' now looks like:

> df
   subject year event.of.interest subject.seq
2        1 1982              TRUE           1
3        1 1996              TRUE           2
1        1 1980             FALSE           3
4        2 1985             FALSE           1
5        2 1987             FALSE           2
7        3 1991              TRUE           1
9        3 1999              TRUE           2
6        3 1990             FALSE           3
8        3 1992             FALSE           4
10       4 1972              TRUE           1
11       4 1983             FALSE           2

# Now set event.seq to all 0's

df$event.seq <- 0

So, 'df' now looks like:

> df
   subject year event.of.interest subject.seq event.seq
2        1 1982              TRUE           1         0
3        1 1996              TRUE           2         0
1        1 1980             FALSE           3         0
4        2 1985             FALSE           1         0
5        2 1987             FALSE           2         0
7        3 1991              TRUE           1         0
9        3 1999              TRUE           2         0
6        3 1990             FALSE           3         0
8        3 1992             FALSE           4         0
10       4 1972              TRUE           1         0
11       4 1983             FALSE           2         0

# Get the unique subject id's
# See ?unique

subj.id <- unique(df$subject)

# Now get the indices for each subject where event.of.interest
# is TRUE.  See ?which

events <- sapply(subj.id, 
                 function(x) which(df$subject == x & df$event.of.interest))

So, 'events' looks like:

> events
[[1]]
[1] 1 2

[[2]]
integer(0)

[[3]]
[1] 6 7

[[4]]
[1] 10

# Now use sapply() on the above list to create
# individual sequences per list element:

seq <- sapply(events, function(x) seq(along = x))

So 'seq' looks like:

> seq
[[1]]
[1] 1 2

[[2]]
integer(0)

[[3]]
[1] 1 2

[[4]]
[1] 1

# So, for the final step, assign the event sequence values in 'seq' to
# the row indices in 'events':

df$event.seq[unlist(events)] <- unlist(seq)

So, 'df' now looks like this:

> df
   subject year event.of.interest subject.seq event.seq
2        1 1982              TRUE           1         1
3        1 1996              TRUE           2         2
1        1 1980             FALSE           3         0
4        2 1985             FALSE           1         0
5        2 1987             FALSE           2         0
7        3 1991              TRUE           1         1
9        3 1999              TRUE           2         2
6        3 1990             FALSE           3         0
8        3 1992             FALSE           4         0
10       4 1972              TRUE           1         1
11       4 1983             FALSE           2         0

HTH,

Marc SChwartz