[R] need any advises for code optimization.

Mon Apr 4 11:29:01 CEST 2005

Hi,

One fruitful course for optimisation is to vectorise wherever
possible, and avoid for-loops.

Something like the code below might be a good place to start.

=============

## Generate a thousand rows of data
cube.half.size <- 2
mult.sigma <- 2
n <- 1000
whole <- data.frame(a=runif(n), b=runif(n), c=rnorm(n), d=runif(n))*10

cube.look <- function() {
  f <- function(x) {
    with(whole,
         {i <- (abs(a - a[x]) < cube.half.size &
                abs(b - b[x]) < cube.half.size &
                abs(d - d[x]) < cube.half.size)
          if ( any(i) ) {
            subdat <- c[i]
            which(i)[abs(subdat - mean(subdat)) > sd(subdat)*mult.sigma]
          } else NULL
        })
  }

  td <- lapply(seq(length=n), f)
  whole[-unique(unlist(td)),]
}

## And wrap the original in a function for comparison:
cube.look.orig <- function() {
  to.drop<-data.frame()

  for(i in 1:length(whole$c)){
    pv<-subset(whole,abs(a-whole$a[i])<cube.half.size &
               abs(b-whole$b[i])<cube.half.size &
               abs(d-whole$d[i])<cube.half.size);
    if(length(pv$c)>1){
      mean.c<-mean(pv$c)
      sd.c<-sd(pv$c)
      td<-subset(pv,abs(c-mean.c)>sd.c*mult.sigma)
      if(length(td$c)>0){
        td.index<-which(row.names(td) %in% row.names(to.drop))
        to.drop<-rbind(to.drop,if(length(td.index)>0) td[-td.index,]else td)
        if(length(td.index)!=length(td$c))
          print(c("i=",i,"Points to drop: ",length(to.drop$c)))
      }
    }
  }
  td.orig <<- to.drop
  ## This does not subset the way you want:
  ##  whole[-which(row.names(to.drop) %in% row.names(whole)),]
  whole[-as.integer(row.names(to.drop)),]
}

## Time how long it takes to run over the test data.frame (10 runs):
t.new <- t(replicate(10, system.time(cube.look())))
t.orig <- t(replicate(10, system.time(cube.look.orig())))

## On my alpha, the version using lapply() takes 4.9 seconds of CPU
## time (avg 10 runs), while the original version takes 23.3 seconds -
## so we're 4.8 times faster.
apply(t.orig, 2, mean)/apply(t.new, 2, mean)

## And the results are the same.
r.new <- cube.look()
r.orig <- cube.look.orig()
identical(r.new, r.orig)

==============

The above code could almost certainly be tweaked (replacing which()
with a stripped down version would probably save some time, since the
profile indicates we spend about 10% of our time there).  Using with()
saved another 10% or so, compared with indexing a..d (e.g. whole$a)
every iteration.  However, trying a completely different approach
would be more likely to yield better time savings.  mapply() might be
one to try, though I have a feeling this is just a wrapper around
lapply().  I believe there is a section in the "Writing R Extensions"
manual that deals with profiling, and may touch on optimisation.

Cheers,
Rich

On Apr 4, 2005 6:50 PM, Wladimir Eremeev <wl at eimb.ru> wrote:
> Dear colleagues,
> 
>   I have the following code. This code is to 'filter' the data set.
> 
>   It works on the data frame 'whole' with four numeric columns: a,b,d, and c.
>   Every row in the data frame is considered as a point in 3-D space.
>   Variables a,b, and d are the point's coordinates, and c is its value.
>   This code looks at every point, builds a cube 'centered' at this
>   point, selects the set of points inside this cube,
>   calculates mean and SD of their values,
>   and drops points whose values differ from the mean more than 2 SD.
> 
>   Here is the code.
> 
> =======
> # initialization
> cube.half.size<-2    # half size of a cube to be built around every point
> 
> mult.sigma<-2        # we will drop every point with value differing
>                      # from mean more than mult.sigma * SD
> 
> to.drop<-data.frame() # the list of points to drop.
> 
> for(i in 1:length(whole$c)){   #look at every point...
>   pv<-subset(whole,abs(a-whole$a[i])<cube.half.size &   #make the subset...
>                    abs(b-whole$b[i])<cube.half.size &
>                    abs(d-whole$d[i])<cube.half.size);
>   if(length(pv$c)>1){  # if subset includes not only considered point, then
>     mean.c<-mean(pv$c)   #  calculate mean and SD
>     sd.c<-sd(pv$c)
> 
> #make a list of points to drop from current subset
>     td<-subset(pv,abs(c-mean.c)>sd.c*mult.sigma)
>     if(length(td$c)>0){
> 
>    #check which of these point are already already in the list to drop
>       td.index<-which(row.names(td) %in% row.names(to.drop))
> 
>    #and replenish the list of points to drop
>       to.drop<-rbind(to.drop,if(length(td.index)>0) td[-td.index,] else td)
> 
>    #print out the message showing,  we're alive (these messages will
>    #not appear regularly, that's OK)
>       if(length(td.index)!=length(td$c))
>         print(c("i=",i,"Points to drop: ",length(to.drop$c)))
>     }
>   }
> }
> 
> # make a new data set without droppped points.
> whole.flt.3<-whole[-which(row.names(to.drop) %in% row.names(whole)),]
> =======
> 
>   The problem is: the 'whole' data set is large, more than 100000
>   rows, and the script runs several hours.
>   The running time becomes greater, if I build a sphere instead of a
>   cube.
> 
>   I would like to optimize it in order to make it run faster.
>   Is it possible?
>   Will a sorting take effect?
>   Thank you for attention and any good feedback.
> 
> --
> Best regards
> Wladimir Eremeev                                     mailto:wl at eimb.ru
> 
> ==========================================================================
> Research Scientist, PhD
> Russian Academy of Sciences
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 

--
Rich FitzJohn
rich.fitzjohn <at> gmail.com   |    http://homepages.paradise.net.nz/richa183
                      You are in a maze of twisty little functions, all alike