[R] speeding up a pairwise correlation calculation
Rajarshi Guha
rxg218 at psu.edu
Fri Nov 21 04:23:27 CET 2003
Hi,
I have a data.frame with 294 columns and 211 rows. I am calculating
correlations between all pairs of columns (excluding column 1) and based
on these correlation values I delete one column from any pair that shows
a R^2 greater than a cuttoff value. (Rather than directly delete the
column all I do is store the column number, and do the deletion later)
The code I am using is:
ndesc <- length(names(data));
for (i in 2:(ndesc-1)) {
for (j in (i+1):ndesc) {
if (i %in% drop || j %in% drop) next;
r2 <- cor(data[,i],data[,j]);
r2 <- r2*r2;
if (r2 >= r2cut) {
rnd <- abs(rnorm(1));
if (rnd < 0.5) { drop <- c(drop,i); }
else { drop <- c(drop,j); }
}
}
}
drop is a vector that contains columns numbers that can be skipped
data is the data.frame
For the data.frame mentioned above (279 columns, 211 rows) the
calculation takes more than 7 minutes (after which I Ctrl-C'ed the
calculation). The machine is a 1GHz Duron with 1GB RAM
The output of version is:
platform i686-pc-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status
major 1
minor 7.1
year 2003
month 06
day 16
language R
I'm not too sure why it takes *so* long (I had done a similar
calculation in Python using list operations and it took forever), but is
there any trick that could be used to make this run faster or is this
type of runtime to be expected?
Thanks,
-------------------------------------------------------------------
Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
A red sign on the door of a physics professor:
'If this sign is blue, you're going too fast.'
More information about the R-help
mailing list