[R] Avoiding for-loop for splitting vector into subvectorsbasedon positions

Wed May 5 23:33:05 CEST 2010

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap
> Sent: Wednesday, May 05, 2010 12:59 PM
> To: Joris Meys; jim holtman
> Cc: R mailing list
> Subject: Re: [R] Avoiding for-loop for splitting vector into 
> subvectorsbasedon positions
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Joris Meys
> > Sent: Tuesday, May 04, 2010 2:02 PM
> > To: jim holtman
> > Cc: R mailing list
> > Subject: Re: [R] Avoiding for-loop for splitting vector into 
> > subvectorsbased on positions
> > 
> > Thanks, works nicely. I have to do some clocking to see how much the
> > improvement is, but I surely learnt again.
> > 
> > Attentive readers might have noticed my initial code contains 
> > an error.
> > tmp <- x[pos2[i]:pos2[i+1]]
> > should be:
> > tmp <- x[pos2[i]:(pos2[i+1]-1)]
> > off course...
> 
> I think you also wanted your for loop to run
> along 1:length(pos) instead of 1:length(x).
> 
> Your subject line asked how to avoid a for loop
> but you seem to be interested in how to make
> your function run quickly.  These are different
> questions.
> 
> The following test functions seem to show that
> your time (and probably memory) problems arise
> from growing a dataset:
>    out <- c()
>    for(i in 1:length(pos)) {
>         ...
>         out<-c(out, length(tmp))
>    }
> instead of preallocating it and inserting into it:
>    out <- numeric(length(pos)) # or integer or list or ... ?
>    for(i in 1:length(pos)) {
>         ...
>         out[i] <- length(tmp)
>    }
> 
> makeData <- function (nX, nPos) {
>     # make data for timing tests
>     pos <- sort(sample(nX, size=nPos, replace=FALSE))
>     pos[1] <- 1L
>     list(x = seq_len(nX), pos = pos)
> }
> 
> f0 <- function (x, pos, FUN = length) {
>     # OP's code, slightly modified
>     pos2 <- c(pos, length(x) + 1)
>     retval <- c()
>     for (i in seq_len(length(pos))) {
>         tmp <- x[pos2[i]:(pos2[i + 1] - 1)]
>         retval <- c(retval, FUN(tmp))
>     }
>     retval
> }
> 
> f1 <- function (x, pos, FUN = length) {
>     # like f0 but we preallocate the result
>     pos2 <- c(pos, length(x) + 1)
>     retval <- numeric(length(pos))
>     for (i in seq_len(length(pos))) {
>         tmp <- x[pos2[i]:(pos2[i + 1] - 1)]
>         retval[i] <- FUN(tmp)
>     }
>     retval
> }
> 
> f2 <- function (x, pos, FUN = length) {
>     # use tapply
>     groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
>     tapply(x, groupId, FUN)
> }
> 
> f3 <- function (x, pos, FUN = length) {
>     # lapply(split(...))
>     groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
>     unlist(lapply(split(x, groupId), FUN))
> }
> 
> # make one million numbers in 400 thousand groups 
> z <- makeData(nX=1e6, nPos=4e5)
> t0 <- system.time( r0 <- f0(z$x, z$pos) )
> t1 <- system.time( r1 <- f1(z$x, z$pos) )
> t2 <- system.time( r2 <- f2(z$x, z$pos) )
> t3 <- system.time( r3 <- f3(z$x, z$pos) )
> 
> > rbind(t0=t0, t1=t1, t2=t2, t3=t3)
>    user.self sys.self elapsed user.child sys.child
> t0    429.44     3.30  425.84         NA        NA
> t1      3.20     0.00    3.16         NA        NA
> t2      6.91     0.01    6.72         NA        NA
> t3      2.68     0.02    2.72         NA        NA

I forgot to mention the new-to-R-2.11.0 vapply()
function.  If you know the type of the output of
FUN and the type is simple enough it can do what
tapply() or [ls]apply(split()) do but more reliably
and using less time and memory.

f4 <- function (x, pos, FUN = length) {
    groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
    vapply(split(x, groupId), FUN = FUN, FUN.VALUE = numeric(1))
}

> system.time(r4 <- f4(z$x, z$pos))
   user  system elapsed 
   2.23    0.01    2.31 
> all(r4==r0)
[1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> 
> The results from each, r0-r3, are almost the same.
> f1 produced a "numeric" (double precision) result
> instead of an integer one (length() returns an integer).
> tapply() spends time seeing if FUN always returns
> the same kind of result and simplifies the answer
> if it does.  The others will run into problems
> if FUN doesn't always return a single number.  Choose
> a method based on how general the code needs to be
> and how much error checking your require.
> 
> In any case, growing a vector that is destined to be
> large can take a lot of time.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com 
> 
> > 
> > On Tue, May 4, 2010 at 5:50 PM, jim holtman 
> > <jholtman at gmail.com> wrote:
> > 
> > > Try this:
> > >
> > > > x <- 1:10
> > > > pos <- c(1,4,7)
> > > > pat <- rep(seq_along(pos), times=diff(c(pos, length(x) + 1)))
> > > > split(x, pat)
> > > $`1`
> > > [1] 1 2 3
> > > $`2`
> > > [1] 4 5 6
> > > $`3`
> > > [1]  7  8  9 10
> > >
> > >
> > >
> > > On Tue, May 4, 2010 at 11:29 AM, Joris Meys 
> > <jorismeys at gmail.com> wrote:
> > >
> > >> Dear all,
> > >>
> > >> I'm trying to optimize code and want to avoid for-loops 
> as much as
> > >> possible.
> > >> I'm applying a calculation on subvectors from a big one, 
> > and I get the
> > >> subvectors by using a vector of starting positions:
> > >>
> > >> x <- 1:10
> > >> pos <- c(1,4,7)
> > >> n <- length(x)
> > >>
> > >> I try to do something like this :
> > >> pos2 <- c(pos, n+1)
> > >>
> > >> out <- c()
> > >> for(i in 1:n){
> > >>     tmp <- x[pos2[i]:pos2[i+1]]
> > >>     out <- c(out, length(tmp))
> > >> }
> > >>
> > >> Never mind the length function, I apply a far more 
> > complicated one. It's
> > >> about the use of the indices in the for-loop. I didn't see 
> > any way of
> > >> doing
> > >> that with an apply, unless there is a very convenient way 
> > of splitting my
> > >> vector in a list of the subvectors or so.
> > >>
> > >> Anybody an idea?
> > >> Cheers
> > >> --
> > >> Joris Meys
> > >> Statistical Consultant
> > >>
> > >> Ghent University
> > >> Faculty of Bioscience Engineering
> > >> Department of Applied mathematics, biometrics and process control
> > >>
> > >> Coupure Links 653
> > >> B-9000 Gent
> > >>
> > >> tel : +32 9 264 59 87
> > >> Joris.Meys at Ugent.be
> > >> -------------------------------
> > >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> > >>
> > >>        [[alternative HTML version deleted]]
> > >>
> > >> ______________________________________________
> > >> R-help at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide
> > >> 
> > http://www.R-project.org/posting-guide.html<http://www.r-proje
> ct.org/posting-guide.html>
> > >> and provide commented, minimal, self-contained, 
> reproducible code.
> > >>
> > >
> > >
> > >
> > > --
> > > Jim Holtman
> > > Cincinnati, OH
> > > +1 513 646 9390
> > >
> > > What is the problem that you are trying to solve?
> > >
> > 
> > 
> > 
> > -- 
> > Joris Meys
> > Statistical Consultant
> > 
> > Ghent University
> > Faculty of Bioscience Engineering
> > Department of Applied mathematics, biometrics and process control
> > 
> > Coupure Links 653
> > B-9000 Gent
> > 
> > tel : +32 9 264 59 87
> > Joris.Meys at Ugent.be
> > -------------------------------
> > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> > 
> > 	[[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>