[R] How to optimize this codes ?

William Dunlap wdunlap at tibco.com
Thu Dec 4 20:49:30 CET 2008


> -----Original Message-----
> From: William Dunlap 
> Sent: Thursday, December 04, 2008 9:59 AM
> To: 'daren76 at hotmail.com'
> Cc: 'R help'
> Subject: Re: [R] How to optimize this codes ?
> 
>    [R] How to optimize this codes ?
>    Daren Tan daren76 at hotmail.com
>    Thu Dec 4 17:02:49 CET 2008
> 
>    How to optimize the for-loop to be reasonably fast for 
> sample.size=100000000 ? 
>    You may want to change sample.size=1000 to have an idea 
> what I am achieving.  
>  
>    set.seed(143)
>    A <- matrix(sample(0:1, sample.size, TRUE), ncol=10, 
> dimnames=list(NULL, LETTERS[1:10]))
>  
>    B <- list()
>    for(i in 1:10) { 
>      B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) 
> { sum(apply(data.frame(A[,x]), 1, all)) })
>    }
> 
> Instead computing all(A[row,x]) each row of the matrix by looping
> over the rows you could loop over the columns.  That is generally
> quicker if there are many more rows than columns.  S+ and R have
> functions called pmin and pmax that do this for min and max, but
> no pany or pall functions.  In your 0/1 case all is the same as min
> so you can replace
>      apply(data.frame(A[,x]), 1, all)
> with
>      sum(do.call("pmin", unname(A[,x,drop=FALSE])))
> after first converting A to a data.frame before starting to compute B.
> (The do.call/unname combo is a hack to pass all the columns 
> of a data.frame
> as separate arguments to a function.  If pmin took a list 
> that wouldn't
> be necessary.)
> 
> When you do timings, looking at one size of problem often 
> isn't helpful,
> as it doesn't tell you how the time depends on the size of 
> the input.  The
> relationship often is not linear.  E.g., I put your method in 
> a function
> called computeB0 and my modification of it into computeB1 and 
> wrote a makeA
> that makes an A matrix with a given sample.size.  Here are 
> the times, it looks
> like 1000 is below the linear range for this example:
> 
> sample.size       time:computeB0   time:computeB1
>        1000                 5.36             0.98
>       10000                36.82             2.49
>      100000               381.34            18.97
> 
> and here are the functions used
> 
> makeA <-
> function(sample.size){
>    set.seed(143);
>    A<-matrix(sample(0:1,size=sample.size,replace=TRUE), 
> ncol=10, dimnames=list(NULL, LETTERS[1:10]));
>    A
> }
> computeB0 <-
> function(A){B<-list()
>   for(i in 1:10) {
>     B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) { 
> sum(apply(data.frame(A[,x]), 1, all)) })
>   }
>   B
> }
> computeB1 <-
> function(A){
>   A<-as.data.frame(A)
>   B<-list()
>   for(i in 1:10) {
>     B[[i]] <- apply(combn(LETTERS[1:10], i), 2,
>                   FUN=function(x) { sum(do.call("pmin", 
> unname(A[,x,drop=FALSE]))) }
>             )
>   }
>   B
> }
> 
> You could probably same more time by restructuring this as a 
> recursive function,
> still operating on columns, that traversed the binary tree of 
> column inclusion/exclusion.
> I just wanted to point out the advantage of looping over 
> columns instead of looping over
> rows.

A hastily written recursive version is:

computeB3 <-
function(A) {
   B <- rep(list(integer(0)), ncol(A))
   storage.mode(A) <- "logical"
   recurse <- function(thisCol = 1, includedCols = rep(NA, ncol(A)),
allSoFar = rep(TRUE, nrow(A))) {
      if (thisCol < ncol(A)) # skip this column
          Recall(thisCol+1, replace(includedCols, thisCol, FALSE),
allSoFar = allSoFar)
      else if (thisCol > ncol(A))
          return()
      includedCols[thisCol] <- TRUE
      allSoFar <- allSoFar & A[,thisCol]
      nIncludedCols <- sum(includedCols[1:thisCol])
      # cat(thisCol, ":", nIncludedCols, ":", includedCols[1:thisCol],
sum(allSoFar), "\n")
      # Note B<<- in next line uses lexical scoping to change
computeB3:B.  Does not work in S+.
      B[[nIncludedCols]][length(B[[nIncludedCols]])+1] <<- sum(allSoFar)
      Recall(thisCol+1, includedCols, allSoFar = allSoFar)
   }
   recurse()
   B
}

The elements of B[[i]], for each i, are in a different order than they
are in
the other functions, but the histograms are the same.

I added its time to the above table:

sample.size       time:computeB0   time:computeB1    time:computeB3
       1000                 5.36             0.98              0.12    
      10000                36.82             2.49              0.25
     100000               381.34            18.97              1.62
    1000000                    ?           290.21             16.36

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 



More information about the R-help mailing list