[R] speeding up a pairwise correlation calculation

Liaw, Andy andy_liaw at merck.com
Fri Nov 21 15:26:11 CET 2003


My guess is that the objective is to delete correlated variables before
doing some sort of modeling...

This is what I would do (untested):

rcut <- sqrt(r2cut)
cormat <- cor(data[, 2:ncol(data)])
## get the position of entries larger than the cutoff
bad.idx <- which(abs(cormat) > rcut, arr.ind=TRUE)
## get the indices of the lower triangular part.
bad.idx <- bad.idx[bad.idx[,1] < bad.idx[,2]] 
## randomly pick one or the other:
drop.idx <- ifelse(runif(nrow(bad.idx)) > .5,
                   bad.idx[,1], bad.idx[,2])

HTH,
Andy

> From: Adaikalavan RAMASAMY [mailto:ramasamya at gis.a-star.edu.sg] 
> 
> You probably want to use runif() instead of rnorm() for equal 
> probability of selecting between i,j
> 
> Your algorithm is of order n^2 [ 294 choose 2, 293 choose 2, 
> ... ], so it should not be too slow. But two for() loops are 
> inefficient in R. Something like this should be fairly fast in C.
> 
> What is you aim in trying to do this ? Your algorithm is similar to
> hclust() - which has nice graphical support - but it merges 
> two nearest neighbour to find another centroid instead of 
> removing one of the neigbours. By removing columns early in 
> stage you are losing information. 
> 
> The alternative would be to use hclust(), select a 
> similarity/dissimilarity cutoff to create groups. Then from 
> each group you can either choose the average profile or 
> randomly select one column to represent the group.
> 
> --
> Adaikalavan Ramasamy 
> 
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Rajarshi Guha
> Sent: Friday, November 21, 2003 11:23 AM
> To: R
> Subject: [R] speeding up a pairwise correlation calculation
> 
> 
> Hi,
>   I have a data.frame with 294 columns and 211 rows. I am 
> calculating correlations between all pairs of columns 
> (excluding column 1) and based on these correlation values I 
> delete one column from any pair that shows a R^2 greater than 
> a cuttoff value. (Rather than directly delete the column all 
> I do is store the column number, and do the deletion later)
> 
> The code I am using is:
> 
>     ndesc <- length(names(data));
>     for (i in 2:(ndesc-1)) {
>         for (j in (i+1):ndesc) {
> 
>             if (i %in% drop || j %in% drop) next;
>             
>             r2 <- cor(data[,i],data[,j]);
>             r2 <- r2*r2;
> 
>             if (r2 >= r2cut) {
>                 rnd <- abs(rnorm(1));
>                 if (rnd < 0.5) { drop <- c(drop,i); }
>                 else { drop <- c(drop,j); }
>             }
>         }
>     }
> 
> drop is a vector that contains columns numbers that can be 
> skipped data is the data.frame
> 
> For the data.frame mentioned above (279 columns, 211 rows) 
> the calculation takes more than 7 minutes (after which I 
> Ctrl-C'ed the calculation). The machine is a 1GHz Duron with 1GB RAM
> 
> The output of version is:
> 
> platform i686-pc-linux-gnu
> arch     i686
> os       linux-gnu
> system   i686, linux-gnu
> status
> major    1
> minor    7.1
> year     2003
> month    06
> day      16
> language R
> 
> I'm not too sure why it takes *so* long (I had done a similar 
> calculation in Python using list operations and it took 
> forever), but is there any trick that could be used to make 
> this run faster or is this type of runtime to be expected?
> 
> Thanks,
> -------------------------------------------------------------------
> Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
> -------------------------------------------------------------------
> A red sign on the door of a physics professor: 
> 'If this sign is blue, you're going too fast.'
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
> 
> 
> ______________________________________________
> 
> R-help at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
>




More information about the R-help mailing list