[R] speeding up a pairwise correlation calculation

Spencer Graves spencer.graves at pdf.com
Fri Nov 21 04:55:33 CET 2003


      Have you tried computing the correlation matrix using "cor" and 
then selecting variables to retain or drop from the resulting 
correlation matrix?  R uses vectorized arithmetic for operations like 
"cor".  By comparison, "for" loops are quite inefficient, requiring 
extra overhead for memory management and validity checking. 

      Alternatively, have you considered vectorizing the inner loop, 
something like the following: 

      ndesc <- dim(data)[2]
      Keep <- rep(TRUE, ndesc)
      for(i in 2:(ndesc-1)){
            if(any(K.i <- Keep[(i+1):ndesc])){          
                  cor.i <- cor(data[,i], data[,((i+1):ndesc)[K.i]])
                  ...<your selection criteria applied to Keep>  

      }

      Obviously, I haven't tested this specific code, but I hope it is 
adequate to illustrate the technique.  It might even be faster than 
either of the other options discussed. 

      hope this helps.  spencer graves

Rajarshi Guha wrote:

>Hi,
>  I have a data.frame with 294 columns and 211 rows. I am calculating
>correlations between all pairs of columns (excluding column 1) and based
>on these correlation values I delete one column from any pair that shows
>a R^2 greater than a cuttoff value. (Rather than directly delete the
>column all I do is store the column number, and do the deletion later)
>
>The code I am using is:
>
>    ndesc <- length(names(data));
>    for (i in 2:(ndesc-1)) {
>        for (j in (i+1):ndesc) {
>
>            if (i %in% drop || j %in% drop) next;
>            
>            r2 <- cor(data[,i],data[,j]);
>            r2 <- r2*r2;
>
>            if (r2 >= r2cut) {
>                rnd <- abs(rnorm(1));
>                if (rnd < 0.5) { drop <- c(drop,i); }
>                else { drop <- c(drop,j); }
>            }
>        }
>    }
>
>drop is a vector that contains columns numbers that can be skipped
>data is the data.frame
>
>For the data.frame mentioned above (279 columns, 211 rows) the
>calculation takes more than 7 minutes (after which I Ctrl-C'ed the
>calculation). The machine is a 1GHz Duron with 1GB RAM
>
>The output of version is:
>
>platform i686-pc-linux-gnu
>arch     i686
>os       linux-gnu
>system   i686, linux-gnu
>status
>major    1
>minor    7.1
>year     2003
>month    06
>day      16
>language R
>
>I'm not too sure why it takes *so* long (I had done a similar
>calculation in Python using list operations and it took forever), but is
>there any trick that could be used to make this run faster or is this
>type of runtime to be expected?
>
>Thanks,
>-------------------------------------------------------------------
>Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
>GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
>-------------------------------------------------------------------
>A red sign on the door of a physics professor: 
>'If this sign is blue, you're going too fast.'
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>  
>




More information about the R-help mailing list