[R] speeding up a pairwise correlation calculation
Liaw, Andy
andy_liaw at merck.com
Fri Nov 21 15:26:11 CET 2003
My guess is that the objective is to delete correlated variables before
doing some sort of modeling...
This is what I would do (untested):
rcut <- sqrt(r2cut)
cormat <- cor(data[, 2:ncol(data)])
## get the position of entries larger than the cutoff
bad.idx <- which(abs(cormat) > rcut, arr.ind=TRUE)
## get the indices of the lower triangular part.
bad.idx <- bad.idx[bad.idx[,1] < bad.idx[,2]]
## randomly pick one or the other:
drop.idx <- ifelse(runif(nrow(bad.idx)) > .5,
bad.idx[,1], bad.idx[,2])
HTH,
Andy
> From: Adaikalavan RAMASAMY [mailto:ramasamya at gis.a-star.edu.sg]
>
> You probably want to use runif() instead of rnorm() for equal
> probability of selecting between i,j
>
> Your algorithm is of order n^2 [ 294 choose 2, 293 choose 2,
> ... ], so it should not be too slow. But two for() loops are
> inefficient in R. Something like this should be fairly fast in C.
>
> What is you aim in trying to do this ? Your algorithm is similar to
> hclust() - which has nice graphical support - but it merges
> two nearest neighbour to find another centroid instead of
> removing one of the neigbours. By removing columns early in
> stage you are losing information.
>
> The alternative would be to use hclust(), select a
> similarity/dissimilarity cutoff to create groups. Then from
> each group you can either choose the average profile or
> randomly select one column to represent the group.
>
> --
> Adaikalavan Ramasamy
>
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Rajarshi Guha
> Sent: Friday, November 21, 2003 11:23 AM
> To: R
> Subject: [R] speeding up a pairwise correlation calculation
>
>
> Hi,
> I have a data.frame with 294 columns and 211 rows. I am
> calculating correlations between all pairs of columns
> (excluding column 1) and based on these correlation values I
> delete one column from any pair that shows a R^2 greater than
> a cuttoff value. (Rather than directly delete the column all
> I do is store the column number, and do the deletion later)
>
> The code I am using is:
>
> ndesc <- length(names(data));
> for (i in 2:(ndesc-1)) {
> for (j in (i+1):ndesc) {
>
> if (i %in% drop || j %in% drop) next;
>
> r2 <- cor(data[,i],data[,j]);
> r2 <- r2*r2;
>
> if (r2 >= r2cut) {
> rnd <- abs(rnorm(1));
> if (rnd < 0.5) { drop <- c(drop,i); }
> else { drop <- c(drop,j); }
> }
> }
> }
>
> drop is a vector that contains columns numbers that can be
> skipped data is the data.frame
>
> For the data.frame mentioned above (279 columns, 211 rows)
> the calculation takes more than 7 minutes (after which I
> Ctrl-C'ed the calculation). The machine is a 1GHz Duron with 1GB RAM
>
> The output of version is:
>
> platform i686-pc-linux-gnu
> arch i686
> os linux-gnu
> system i686, linux-gnu
> status
> major 1
> minor 7.1
> year 2003
> month 06
> day 16
> language R
>
> I'm not too sure why it takes *so* long (I had done a similar
> calculation in Python using list operations and it took
> forever), but is there any trick that could be used to make
> this run faster or is this type of runtime to be expected?
>
> Thanks,
> -------------------------------------------------------------------
> Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
> -------------------------------------------------------------------
> A red sign on the door of a physics professor:
> 'If this sign is blue, you're going too fast.'
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
>
>
> ______________________________________________
>
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
>
More information about the R-help
mailing list