[R] How to speed up or avoid the for-loops in this example?

Thu Feb 15 04:29:27 CET 2007

Marc Schwartz wrote:
> OK, here is one possible solution, though perhaps with a bit more time,
> there may be more optimal approaches. 
> 
> Using your example data above, but first noting that you do not want to
> use:
> 
>   df <- data.frame(cbind(subject,year,event.of.interest))
> 
> Using cbind() first, creates a matrix and causes all columns to be
> coerced to a common data type, obviating the benefit of data frames to
> be able to handle multiple data types. 

Yes, quite right, the cbind() was unnecessary. I'm not making my real
data frame that way, however.

> So, now on to the solution:
> 
> # First, order the data frame by increasing order of
> # subject number and decreasing order for event.of.interest
> # This ensures that these columns are properly sorted
> # to facilitate the subsequent code. 
> 
> df <- df[order(df$subject, -df$event.of.interest), ]
> 
> 
> So, 'df' will look like:
> 
>> df
>    subject year event.of.interest
> 2        1 1982              TRUE
> 3        1 1996              TRUE
> 1        1 1980             FALSE
> 4        2 1985             FALSE
> 5        2 1987             FALSE
> 7        3 1991              TRUE
> 9        3 1999              TRUE
> 6        3 1990             FALSE
> 8        3 1992             FALSE
> 10       4 1972              TRUE
> 11       4 1983             FALSE
> 
> 
> # Now use the combinations of sapply(), rle(), seq() and unlist() to
> # generate per subject sequences. Note that rle() returns:
> #
> # > rle(df$subject)
> # Run Length Encoding
> #   lengths: int [1:4] 3 2 4 2
> #   values : num [1:4] 1 2 3 4
> #
> # See ?rle, ?seq, ?sapply and ?unlist
> 
> df$subject.seq <- unlist(sapply(rle(df$subject)$lengths, 
>                                 function(x) seq(x)))
> 
> 
> So, 'df' now looks like:
> 
>> df
>    subject year event.of.interest subject.seq
> 2        1 1982              TRUE           1
> 3        1 1996              TRUE           2
> 1        1 1980             FALSE           3
> 4        2 1985             FALSE           1
> 5        2 1987             FALSE           2
> 7        3 1991              TRUE           1
> 9        3 1999              TRUE           2
> 6        3 1990             FALSE           3
> 8        3 1992             FALSE           4
> 10       4 1972              TRUE           1
> 11       4 1983             FALSE           2
> 
> 
> # Now set event.seq to all 0's
> 
> df$event.seq <- 0
> 
> 
> So, 'df' now looks like:
> 
>> df
>    subject year event.of.interest subject.seq event.seq
> 2        1 1982              TRUE           1         0
> 3        1 1996              TRUE           2         0
> 1        1 1980             FALSE           3         0
> 4        2 1985             FALSE           1         0
> 5        2 1987             FALSE           2         0
> 7        3 1991              TRUE           1         0
> 9        3 1999              TRUE           2         0
> 6        3 1990             FALSE           3         0
> 8        3 1992             FALSE           4         0
> 10       4 1972              TRUE           1         0
> 11       4 1983             FALSE           2         0
> 
> 
> # Get the unique subject id's
> # See ?unique
> 
> subj.id <- unique(df$subject)
> 
> 
> # Now get the indices for each subject where event.of.interest
> # is TRUE.  See ?which
> 
> events <- sapply(subj.id, 
>                  function(x) which(df$subject == x & df$event.of.interest))
> 
> 
> So, 'events' looks like:
> 
>> events
> [[1]]
> [1] 1 2
> 
> [[2]]
> integer(0)
> 
> [[3]]
> [1] 6 7
> 
> [[4]]
> [1] 10
> 
> 
> # Now use sapply() on the above list to create
> # individual sequences per list element:
> 
> seq <- sapply(events, function(x) seq(along = x))
> 
> 
> So 'seq' looks like:
> 
>> seq
> [[1]]
> [1] 1 2
> 
> [[2]]
> integer(0)
> 
> [[3]]
> [1] 1 2
> 
> [[4]]
> [1] 1
> 
> 
> # So, for the final step, assign the event sequence values in 'seq' to
> # the row indices in 'events':
> 
> df$event.seq[unlist(events)] <- unlist(seq)
> 
> 
> So, 'df' now looks like this:
> 
>> df
>    subject year event.of.interest subject.seq event.seq
> 2        1 1982              TRUE           1         1
> 3        1 1996              TRUE           2         2
> 1        1 1980             FALSE           3         0
> 4        2 1985             FALSE           1         0
> 5        2 1987             FALSE           2         0
> 7        3 1991              TRUE           1         1
> 9        3 1999              TRUE           2         2
> 6        3 1990             FALSE           3         0
> 8        3 1992             FALSE           4         0
> 10       4 1972              TRUE           1         1
> 11       4 1983             FALSE           2         0
> 
> 
> HTH,
> 
> Marc SChwartz

Wow, that's very trick, or tricky. It works but it is a bit slower and
more complex than the Holtzman/Nielsen approach. But some interesting
ides there which I shall bear in mind.

Tim C