[R] Avoiding loops

William Dunlap wdunlap at tibco.com
Wed Sep 2 18:50:54 CEST 2009


Martin,
   Thanks for showing the timing tests.  It is important
to see how the time (and memory usage) grows with
the size of the problem, where size may be the number
of rows or length of the lag.
   Here is another function to toss in the hat.  It uses no
loops and does all the sum by diff'ing a cumsum, which
loses some precision.  I think the big speedup comes from
the calculation of startPos.  You can also use
approx(method="const") or some zoo function to do this.

The gSum function, which  computes sums of overlapping
subsequences of its input, could be changed to a call to lapply
without a dramatic lose of speed and thus avoid the precision
problems.

It also supposes that dat$a is already sorted. 

f.wwd <- function(dat, max=5) {
   # filtering approach
   a <- dat$a
   minStart <- a - max
   i <- rep(c(FALSE, TRUE), each=length(a))[order(c(minStart, a))]
   startPos <- cumsum(i)[!i] + 1
   endPos <- seq(along=a)
   gSum <- function(x) {
      cs <- cumsum(x)
      cs[endPos] - cs[startPos] + x[startPos]
   }
   dat$b <- gSum(dat$b)
   dat$c <- gSum(dat$c)
   dat
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Martin Morgan
> Sent: Wednesday, September 02, 2009 9:17 AM
> To: Alexander Shenkin
> Cc: r-help at r-project.org; spector at stat.berkeley.edu; 
> cberry at tajo.ucsd.edu
> Subject: Re: [R] Avoiding loops
> 
> Alexander Shenkin wrote:
> > Though, from my limited understanding, the 'apply' family 
> of functions
> > are actually just loops.  Please correct me if I'm wrong.  So, while
> > more readable (which is important), they're not necessarily more
> > efficient than explicit 'for' loops.
> 
> Hi Allie -- This uses an R-level loop (and a lot of C loops!), but the
> length of the loop is only as long as the maximum lag
> 
> 
> f0 <- function(df0, max_lag)
> {
>     max_lag <- min(nrow(df0), max_lag)
>     a <- df0[[1]]
>     ans <- df <- df0[,-1, drop=FALSE]
>     for (lag in seq_len(max_lag)) {
>         idx <- diff(a, lag) <= max_lag
>         pad <- logical(lag)
>         ans[c(pad, idx),] <- ans[c(pad, idx),] + df[c(idx, pad),]
>     }
>     cbind(a, ans)
> }
> 
> it makes the assumption that 'a' is sorted and unique, as in a time
> series. This
> 
> f1 <- function(df0, max_lag)
> {
>     max_lag <- min(nrow(df0), max_lag)
>     a <- df0[[1]]
>     ans <- df0[,-1, drop=FALSE]
>     lag <- 1
>     while(sum(idx <- diff(a, lag) <= max_lag) != 0) {
>         pad <- logical(lag)
>         ans[c(pad, idx),] <- ans[c(pad, idx),] + df[c(idx, pad),]
>         lag <- lag + 1
>     }
>     cbind(a, ans)
> }
> 
> relaxes the assumption that 'a' is unique, I think, but I 
> haven't tested
> carefully; it seems to perform about the same as f0. I think there's a
> clever recursive solution in there, too.
> 
> This is my implementation of Phil's solution
> 
> phil0 <- function(df0, max_lag)
> {
>     with(df0, {
>         g <- function(x)
>             apply(df0[a - x >= -max_lag & a - x <= 0, c('b','c')],
>                   2, sum)
>         data.frame(a, t(sapply(a, g)))
>     })
> }
> 
> Here's my implementation of Chuck Berry's solution
> 
> chuck0 <- function(df0, max_lag)
> {
>     criterion <-
>         as.matrix(dist(df0$a)) <= max_lag & outer(df0$a,df0$a,">=")
>     criterion %*% as.matrix(df0[, c("b","c")])
> }
> 
> Here's a data generator
> 
> setup <- function(n, m)
>     ## n: number of rows
>     ## m: expected counts per sum
> {
>     a0 <- sort(sample(seq_len(m * n), n))
>     data.frame(a=a0, b=as.integer(runif(n, 1, 10)),
>                c=as.integer(runif(n, 1, 10)))
> }
> 
> and a comparison with
> 
> df0 <- setup(10^3, 3)
> max_lag <- 5
> 
> > system.time(f0res <- f0(df0, max_lag), gcFirst=TRUE)
>    user  system elapsed
>   0.016   0.000   0.016
> > system.time(phil0res <- phil0(df0, max_lag), gcFirst=TRUE)
>    user  system elapsed
>   0.960   0.000   0.962
> > system.time(chuck0res <- chuck0(df0, 5), gcFirst=TRUE)
>    user  system elapsed
>   0.252   0.000   0.254
> 
> > all.equal(f0res, phil0res)
> [1] TRUE
> 
> > all.equal(as.matrix(f0res[,2:3]), chuck0res, check.attributes=FALSE)
> [1] TRUE
> 
> The f0 solution seems to be usable up to about a million rows,
> 
> > df0 <- setup(10^6, 3)
> > system.time(f0res <- f0(df0, max_lag), gcFirst=TRUE)
>    user  system elapsed
>   2.680   0.004   2.700
> 
> Martin
> 
> > 
> > allie
> > 
> > On 9/2/2009 3:13 AM, Phil Spector wrote:
> >> Here's one way (assuming your data frame is named dat):
> >>
> >>    with(dat,
> >>         data.frame(a,t(sapply(a,function(x){
> >>                        apply(dat[a - x >= -5 & a - x <=
> >> 0,c('b','c')],2,sum)}))))
> >>
> >>
> >>                     - Phil Spector
> >>                      Statistical Computing Facility
> >>                      Department of Statistics
> >>                      UC Berkeley
> >>                      spector at stat.berkeley.edu
> >>
> >>
> >>
> >> On Tue, 1 Sep 2009, dolar wrote:
> >>
> >>> Would like some tips on how to avoid loops as I know they 
> are slow in R
> >>>
> >>> i've got a data frame :
> >>>
> >>> a  b  c
> >>> 1  5  2
> >>> 4  6  9
> >>> 5  2  3
> >>> 8  3  2
> >>>
> >>> What i'd like is to sum for each value of a, the sum of b 
> and the sum
> >>> of c
> >>> where a equal to or less than (with a distance of 5)
> >>>
> >>> i.e. for row three
> >>> we have a=5
> >>> i'd like to sum up b and sum up c with the above rule
> >>> since 5, 4 and 1 are less than (within a distance of 5) 
> or equal to
> >>> 5, then
> >>> we should get the following result:
> >>>
> >>> a  b   c
> >>> 5  13  14
> >>>
> >>> the overall result should be
> >>> a   b   c
> >>> 1   5   2
> >>> 4   11  11
> >>> 5   13  14
> >>> 8   11  14
> >>>
> >>> how can i do this without a loop?
> >>> -- 
> >>> View this message in context:
> >>> http://www.nabble.com/Avoiding-loops-tp25251376p25251376.html
> >>> Sent from the R help mailing list archive at Nabble.com.
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 




More information about the R-help mailing list