[R] How to speed up or avoid the for-loops in this example?
Marc Schwartz
marc_schwartz at comcast.net
Thu Feb 15 03:48:50 CET 2007
On Thu, 2007-02-15 at 12:24 +1100, Tim Churches wrote:
> Any advice, tips, clues or pointers to resources on how best to speed up
> or, better still, avoid the loops in the following example code much
> appreciated. My actual dataset has several tens of thousands of rows and
> lots of columns, and these loops take a rather long time to run.
> Everything else which I need to do is done using vectors and those parts
> all run very quickly indeed. I spent quite a while doing searches on
> r-help and re-reading the various manuals, but couldn't find any
> existing relevant advice. I am sure the solution is obvious, but it
> escapes me.
>
> Tim C
>
> # create an example data frame, multiple events per subject
>
> year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983)
> event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F)
> subject <- c(1,1,1,2,2,3,3,3,3,4,4)
> df <- data.frame(cbind(subject,year,event.of.interest))
>
> # add a per-subject sequence number
>
> df$subject.seq <- 1
> for (i in 2:nrow(df)) {
> if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <-
> df$subject.seq[i-1] + 1
> }
> df
>
> # add an event sequence number which is zero until the first
> # event of interest for that subject happens, and then increments
> # thereafter
>
> df$event.seq <- 0
> for (i in 1:nrow(df)) {
> if (df$subject.seq[i] == 1 ) {
> current.event.seq <- 0
> }
> if (event.of.interest[i] == 1 | current.event.seq > 0)
> current.event.seq <- current.event.seq + 1
> df$event.seq[i] <- current.event.seq
> }
> df
OK, here is one possible solution, though perhaps with a bit more time,
there may be more optimal approaches.
Using your example data above, but first noting that you do not want to
use:
df <- data.frame(cbind(subject,year,event.of.interest))
Using cbind() first, creates a matrix and causes all columns to be
coerced to a common data type, obviating the benefit of data frames to
be able to handle multiple data types. For example:
> str(df)
'data.frame': 11 obs. of 3 variables:
$ subject : num 1 1 1 2 2 3 3 3 3 4 ...
$ year : num 1980 1982 1996 1985 1987 ...
$ event.of.interest: num 0 1 1 0 0 0 1 0 1 1 ...
Note that your column "event.of.interest" is coerced to a numeric,
rather than staying as a logical.
Thus, use:
df <- data.frame(subject, year, event.of.interest)
> str(df)
'data.frame': 11 obs. of 3 variables:
$ subject : num 1 1 1 2 2 3 3 3 3 4 ...
$ year : num 1980 1982 1996 1985 1987 ...
$ event.of.interest: logi FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
So, now on to the solution:
# First, order the data frame by increasing order of
# subject number and decreasing order for event.of.interest
# This ensures that these columns are properly sorted
# to facilitate the subsequent code.
df <- df[order(df$subject, -df$event.of.interest), ]
So, 'df' will look like:
> df
subject year event.of.interest
2 1 1982 TRUE
3 1 1996 TRUE
1 1 1980 FALSE
4 2 1985 FALSE
5 2 1987 FALSE
7 3 1991 TRUE
9 3 1999 TRUE
6 3 1990 FALSE
8 3 1992 FALSE
10 4 1972 TRUE
11 4 1983 FALSE
# Now use the combinations of sapply(), rle(), seq() and unlist() to
# generate per subject sequences. Note that rle() returns:
#
# > rle(df$subject)
# Run Length Encoding
# lengths: int [1:4] 3 2 4 2
# values : num [1:4] 1 2 3 4
#
# See ?rle, ?seq, ?sapply and ?unlist
df$subject.seq <- unlist(sapply(rle(df$subject)$lengths,
function(x) seq(x)))
So, 'df' now looks like:
> df
subject year event.of.interest subject.seq
2 1 1982 TRUE 1
3 1 1996 TRUE 2
1 1 1980 FALSE 3
4 2 1985 FALSE 1
5 2 1987 FALSE 2
7 3 1991 TRUE 1
9 3 1999 TRUE 2
6 3 1990 FALSE 3
8 3 1992 FALSE 4
10 4 1972 TRUE 1
11 4 1983 FALSE 2
# Now set event.seq to all 0's
df$event.seq <- 0
So, 'df' now looks like:
> df
subject year event.of.interest subject.seq event.seq
2 1 1982 TRUE 1 0
3 1 1996 TRUE 2 0
1 1 1980 FALSE 3 0
4 2 1985 FALSE 1 0
5 2 1987 FALSE 2 0
7 3 1991 TRUE 1 0
9 3 1999 TRUE 2 0
6 3 1990 FALSE 3 0
8 3 1992 FALSE 4 0
10 4 1972 TRUE 1 0
11 4 1983 FALSE 2 0
# Get the unique subject id's
# See ?unique
subj.id <- unique(df$subject)
# Now get the indices for each subject where event.of.interest
# is TRUE. See ?which
events <- sapply(subj.id,
function(x) which(df$subject == x & df$event.of.interest))
So, 'events' looks like:
> events
[[1]]
[1] 1 2
[[2]]
integer(0)
[[3]]
[1] 6 7
[[4]]
[1] 10
# Now use sapply() on the above list to create
# individual sequences per list element:
seq <- sapply(events, function(x) seq(along = x))
So 'seq' looks like:
> seq
[[1]]
[1] 1 2
[[2]]
integer(0)
[[3]]
[1] 1 2
[[4]]
[1] 1
# So, for the final step, assign the event sequence values in 'seq' to
# the row indices in 'events':
df$event.seq[unlist(events)] <- unlist(seq)
So, 'df' now looks like this:
> df
subject year event.of.interest subject.seq event.seq
2 1 1982 TRUE 1 1
3 1 1996 TRUE 2 2
1 1 1980 FALSE 3 0
4 2 1985 FALSE 1 0
5 2 1987 FALSE 2 0
7 3 1991 TRUE 1 1
9 3 1999 TRUE 2 2
6 3 1990 FALSE 3 0
8 3 1992 FALSE 4 0
10 4 1972 TRUE 1 1
11 4 1983 FALSE 2 0
HTH,
Marc SChwartz
More information about the R-help
mailing list