[R] Avoiding for-loop for splitting vector into subvectorsbased on positions
William Dunlap
wdunlap at tibco.com
Wed May 5 21:58:53 CEST 2010
>
> Thanks, works nicely. I have to do some clocking to see how much the
> improvement is, but I surely learnt again.
>
> Attentive readers might have noticed my initial code contains
> an error.
> tmp <- x[pos2[i]:pos2[i+1]]
> should be:
> tmp <- x[pos2[i]:(pos2[i+1]-1)]
> off course...
I think you also wanted your for loop to run
along 1:length(pos) instead of 1:length(x).
Your subject line asked how to avoid a for loop
but you seem to be interested in how to make
your function run quickly. These are different
questions.
The following test functions seem to show that
your time (and probably memory) problems arise
from growing a dataset:
out <- c()
for(i in 1:length(pos)) {
...
out<-c(out, length(tmp))
}
instead of preallocating it and inserting into it:
out <- numeric(length(pos)) # or integer or list or ... ?
for(i in 1:length(pos)) {
...
out[i] <- length(tmp)
}
makeData <- function (nX, nPos) {
# make data for timing tests
pos <- sort(sample(nX, size=nPos, replace=FALSE))
pos[1] <- 1L
list(x = seq_len(nX), pos = pos)
}
f0 <- function (x, pos, FUN = length) {
# OP's code, slightly modified
pos2 <- c(pos, length(x) + 1)
retval <- c()
for (i in seq_len(length(pos))) {
tmp <- x[pos2[i]:(pos2[i + 1] - 1)]
retval <- c(retval, FUN(tmp))
}
retval
}
f1 <- function (x, pos, FUN = length) {
# like f0 but we preallocate the result
pos2 <- c(pos, length(x) + 1)
retval <- numeric(length(pos))
for (i in seq_len(length(pos))) {
tmp <- x[pos2[i]:(pos2[i + 1] - 1)]
retval[i] <- FUN(tmp)
}
retval
}
f2 <- function (x, pos, FUN = length) {
# use tapply
groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
tapply(x, groupId, FUN)
}
f3 <- function (x, pos, FUN = length) {
# lapply(split(...))
groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
unlist(lapply(split(x, groupId), FUN))
}
# make one million numbers in 400 thousand groups
z <- makeData(nX=1e6, nPos=4e5)
t0 <- system.time( r0 <- f0(z$x, z$pos) )
t1 <- system.time( r1 <- f1(z$x, z$pos) )
t2 <- system.time( r2 <- f2(z$x, z$pos) )
t3 <- system.time( r3 <- f3(z$x, z$pos) )
> rbind(t0=t0, t1=t1, t2=t2, t3=t3)
user.self sys.self elapsed user.child sys.child
t0 429.44 3.30 425.84 NA NA
t1 3.20 0.00 3.16 NA NA
t2 6.91 0.01 6.72 NA NA
t3 2.68 0.02 2.72 NA NA
The results from each, r0-r3, are almost the same.
f1 produced a "numeric" (double precision) result
instead of an integer one (length() returns an integer).
tapply() spends time seeing if FUN always returns
the same kind of result and simplifies the answer
if it does. The others will run into problems
if FUN doesn't always return a single number. Choose
a method based on how general the code needs to be
and how much error checking your require.
In any case, growing a vector that is destined to be
large can take a lot of time.
>
