[R] speeding up a pairwise correlation calculation
Spencer Graves
spencer.graves at pdf.com
Fri Nov 21 04:55:33 CET 2003
Have you tried computing the correlation matrix using "cor" and
then selecting variables to retain or drop from the resulting
correlation matrix? R uses vectorized arithmetic for operations like
"cor". By comparison, "for" loops are quite inefficient, requiring
extra overhead for memory management and validity checking.
Alternatively, have you considered vectorizing the inner loop,
something like the following:
ndesc <- dim(data)[2]
Keep <- rep(TRUE, ndesc)
for(i in 2:(ndesc-1)){
if(any(K.i <- Keep[(i+1):ndesc])){
cor.i <- cor(data[,i], data[,((i+1):ndesc)[K.i]])
...<your selection criteria applied to Keep>
}
Obviously, I haven't tested this specific code, but I hope it is
adequate to illustrate the technique. It might even be faster than
either of the other options discussed.
hope this helps. spencer graves
Rajarshi Guha wrote:
>Hi,
> I have a data.frame with 294 columns and 211 rows. I am calculating
>correlations between all pairs of columns (excluding column 1) and based
>on these correlation values I delete one column from any pair that shows
>a R^2 greater than a cuttoff value. (Rather than directly delete the
>column all I do is store the column number, and do the deletion later)
>
>The code I am using is:
>
> ndesc <- length(names(data));
> for (i in 2:(ndesc-1)) {
> for (j in (i+1):ndesc) {
>
> if (i %in% drop || j %in% drop) next;
>
> r2 <- cor(data[,i],data[,j]);
> r2 <- r2*r2;
>
> if (r2 >= r2cut) {
> rnd <- abs(rnorm(1));
> if (rnd < 0.5) { drop <- c(drop,i); }
> else { drop <- c(drop,j); }
> }
> }
> }
>
>drop is a vector that contains columns numbers that can be skipped
>data is the data.frame
>
>For the data.frame mentioned above (279 columns, 211 rows) the
>calculation takes more than 7 minutes (after which I Ctrl-C'ed the
>calculation). The machine is a 1GHz Duron with 1GB RAM
>
>The output of version is:
>
>platform i686-pc-linux-gnu
>arch i686
>os linux-gnu
>system i686, linux-gnu
>status
>major 1
>minor 7.1
>year 2003
>month 06
>day 16
>language R
>
>I'm not too sure why it takes *so* long (I had done a similar
>calculation in Python using list operations and it took forever), but is
>there any trick that could be used to make this run faster or is this
>type of runtime to be expected?
>
>Thanks,
>-------------------------------------------------------------------
>Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
>GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
>-------------------------------------------------------------------
>A red sign on the door of a physics professor:
>'If this sign is blue, you're going too fast.'
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
>
More information about the R-help
mailing list