[R] Avoiding for-loop for splitting vector into subvectorsbasedon positions
William Dunlap
wdunlap at tibco.com
Wed May 5 23:33:05 CEST 2010
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap
> Sent: Wednesday, May 05, 2010 12:59 PM
> To: Joris Meys; jim holtman
> Cc: R mailing list
> Subject: Re: [R] Avoiding for-loop for splitting vector into
> subvectorsbasedon positions
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Joris Meys
> > Sent: Tuesday, May 04, 2010 2:02 PM
> > To: jim holtman
> > Cc: R mailing list
> > Subject: Re: [R] Avoiding for-loop for splitting vector into
> > subvectorsbased on positions
> >
> > Thanks, works nicely. I have to do some clocking to see how much the
> > improvement is, but I surely learnt again.
> >
> > Attentive readers might have noticed my initial code contains
> > an error.
> > tmp <- x[pos2[i]:pos2[i+1]]
> > should be:
> > tmp <- x[pos2[i]:(pos2[i+1]-1)]
> > off course...
>
> I think you also wanted your for loop to run
> along 1:length(pos) instead of 1:length(x).
>
> Your subject line asked how to avoid a for loop
> but you seem to be interested in how to make
> your function run quickly. These are different
> questions.
>
> The following test functions seem to show that
> your time (and probably memory) problems arise
> from growing a dataset:
> out <- c()
> for(i in 1:length(pos)) {
> ...
> out<-c(out, length(tmp))
> }
> instead of preallocating it and inserting into it:
> out <- numeric(length(pos)) # or integer or list or ... ?
> for(i in 1:length(pos)) {
> ...
> out[i] <- length(tmp)
> }
>
> makeData <- function (nX, nPos) {
> # make data for timing tests
> pos <- sort(sample(nX, size=nPos, replace=FALSE))
> pos[1] <- 1L
> list(x = seq_len(nX), pos = pos)
> }
>
> f0 <- function (x, pos, FUN = length) {
> # OP's code, slightly modified
> pos2 <- c(pos, length(x) + 1)
> retval <- c()
> for (i in seq_len(length(pos))) {
> tmp <- x[pos2[i]:(pos2[i + 1] - 1)]
> retval <- c(retval, FUN(tmp))
> }
> retval
> }
>
> f1 <- function (x, pos, FUN = length) {
> # like f0 but we preallocate the result
> pos2 <- c(pos, length(x) + 1)
> retval <- numeric(length(pos))
> for (i in seq_len(length(pos))) {
> tmp <- x[pos2[i]:(pos2[i + 1] - 1)]
> retval[i] <- FUN(tmp)
> }
> retval
> }
>
> f2 <- function (x, pos, FUN = length) {
> # use tapply
> groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
> tapply(x, groupId, FUN)
> }
>
> f3 <- function (x, pos, FUN = length) {
> # lapply(split(...))
> groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
> unlist(lapply(split(x, groupId), FUN))
> }
>
> # make one million numbers in 400 thousand groups
> z <- makeData(nX=1e6, nPos=4e5)
> t0 <- system.time( r0 <- f0(z$x, z$pos) )
> t1 <- system.time( r1 <- f1(z$x, z$pos) )
> t2 <- system.time( r2 <- f2(z$x, z$pos) )
> t3 <- system.time( r3 <- f3(z$x, z$pos) )
>
> > rbind(t0=t0, t1=t1, t2=t2, t3=t3)
> user.self sys.self elapsed user.child sys.child
> t0 429.44 3.30 425.84 NA NA
> t1 3.20 0.00 3.16 NA NA
> t2 6.91 0.01 6.72 NA NA
> t3 2.68 0.02 2.72 NA NA
I forgot to mention the new-to-R-2.11.0 vapply()
function. If you know the type of the output of
FUN and the type is simple enough it can do what
tapply() or [ls]apply(split()) do but more reliably
and using less time and memory.
f4 <- function (x, pos, FUN = length) {
groupId <- rep(seq_along(pos), diff(c(pos, length(x) + 1)))
vapply(split(x, groupId), FUN = FUN, FUN.VALUE = numeric(1))
}
> system.time(r4 <- f4(z$x, z$pos))
user system elapsed
2.23 0.01 2.31
> all(r4==r0)
[1] TRUE
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
>
> The results from each, r0-r3, are almost the same.
> f1 produced a "numeric" (double precision) result
> instead of an integer one (length() returns an integer).
> tapply() spends time seeing if FUN always returns
> the same kind of result and simplifies the answer
> if it does. The others will run into problems
> if FUN doesn't always return a single number. Choose
> a method based on how general the code needs to be
> and how much error checking your require.
>
> In any case, growing a vector that is destined to be
> large can take a lot of time.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
> >
> > On Tue, May 4, 2010 at 5:50 PM, jim holtman
> > <jholtman at gmail.com> wrote:
> >
> > > Try this:
> > >
> > > > x <- 1:10
> > > > pos <- c(1,4,7)
> > > > pat <- rep(seq_along(pos), times=diff(c(pos, length(x) + 1)))
> > > > split(x, pat)
> > > $`1`
> > > [1] 1 2 3
> > > $`2`
> > > [1] 4 5 6
> > > $`3`
> > > [1] 7 8 9 10
> > >
> > >
> > >
> > > On Tue, May 4, 2010 at 11:29 AM, Joris Meys
> > <jorismeys at gmail.com> wrote:
> > >
> > >> Dear all,
> > >>
> > >> I'm trying to optimize code and want to avoid for-loops
> as much as
> > >> possible.
> > >> I'm applying a calculation on subvectors from a big one,
> > and I get the
> > >> subvectors by using a vector of starting positions:
> > >>
> > >> x <- 1:10
> > >> pos <- c(1,4,7)
> > >> n <- length(x)
> > >>
> > >> I try to do something like this :
> > >> pos2 <- c(pos, n+1)
> > >>
> > >> out <- c()
> > >> for(i in 1:n){
> > >> tmp <- x[pos2[i]:pos2[i+1]]
> > >> out <- c(out, length(tmp))
> > >> }
> > >>
> > >> Never mind the length function, I apply a far more
> > complicated one. It's
> > >> about the use of the indices in the for-loop. I didn't see
> > any way of
> > >> doing
> > >> that with an apply, unless there is a very convenient way
> > of splitting my
> > >> vector in a list of the subvectors or so.
> > >>
> > >> Anybody an idea?
> > >> Cheers
> > >> --
> > >> Joris Meys
> > >> Statistical Consultant
> > >>
> > >> Ghent University
> > >> Faculty of Bioscience Engineering
> > >> Department of Applied mathematics, biometrics and process control
> > >>
> > >> Coupure Links 653
> > >> B-9000 Gent
> > >>
> > >> tel : +32 9 264 59 87
> > >> Joris.Meys at Ugent.be
> > >> -------------------------------
> > >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> > >>
> > >> [[alternative HTML version deleted]]
> > >>
> > >> ______________________________________________
> > >> R-help at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide
> > >>
> > http://www.R-project.org/posting-guide.html<http://www.r-proje
> ct.org/posting-guide.html>
> > >> and provide commented, minimal, self-contained,
> reproducible code.
> > >>
> > >
> > >
> > >
> > > --
> > > Jim Holtman
> > > Cincinnati, OH
> > > +1 513 646 9390
> > >
> > > What is the problem that you are trying to solve?
> > >
> >
> >
> >
> > --
> > Joris Meys
> > Statistical Consultant
> >
> > Ghent University
> > Faculty of Bioscience Engineering
> > Department of Applied mathematics, biometrics and process control
> >
> > Coupure Links 653
> > B-9000 Gent
> >
> > tel : +32 9 264 59 87
> > Joris.Meys at Ugent.be
> > -------------------------------
> > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list