[R] Correlation of huge matrix saved as binary file
Peter Langfelder
peter.langfelder at gmail.com
Sat Mar 3 02:36:07 CET 2012
I don't think you can speed it up by a whole lot... but you can try a
few things, especially if you don't have missing data in the matrix
(which you probably don't). The main question is what takes most of
the time- the api calls or the cor() call? If it's cor, here's what
you can try:
1. Pre-standardize the entire matrix input matrix, i.e. scale each
column to mean=0 and sum of squares=1. Save the standardized matrix
(or make sure it's available to api). Since your matrix only has 9000
columns, this should not take extremely long.
2. Instead of calculating correlations, calculate simply sum(g1*g2) -
if g1 and g2 are standardized as above, correlation equals sum(g1*g2).
3. Instead of calculating the correlations one-by-one, calculate them
in small blocks (if you have enough memory and you run a 64-bit R).
With 900M rows, you will only be able to put a 900Mx2 into an R
object, but if you have two such standardized matrices loaded in g1,
g2, you can get their (2x2) correlation matrix by t(g1) %*% g2. This
2x2 matrix you can use to fill the appropriate components of the
result matrix.
4. Use one of the multi-threading packages (multicore comes to mind
but there are others) to parallelize your code. If you have 8
available cores, you can expect a nearly 8x speedup.
All in all, this will probably still take forever, but should be one
or two orders of magnitude faster than your current code :)
HTH,
Peter
On Fri, Mar 2, 2012 at 2:50 PM, Bryo <brynedal at gmail.com> wrote:
> Hi,
>
> I have a 900,000,000*9,000 matrix where I need to calculate the correlation
> between all entries along the smaller dimension, thus creating a 9k*9k
> correlation matrix. This matrix is too big to be uploaded in R, and is saved
> as a binary file. To access the data in the file I use mmap and some
> api-functions (to get all values in one row, one column, or one particular
> value). I'm looking for some advice in how to calculate the correlation
> matrix. Right now my approach is to do something similar to this (toy code):
>
> corr.matrix<-matrix('numeric',ncol=9000,nrow=9000)
>
> for (i in 1:9000) {
> for (j in (i+1):9000) {
> # i1=... getting the index of item (i) in a second file
> # i2=....getting the index of item (j)
> g1=api$getCol(i1)
> g2=api$getCol(i2)
> cor.matrix[i,j]=cor(g1,g2)
> }}
>
> This will work, but will take forever. Any advice for how this can be done
> more efficiently? I'm running on a 2.6.18 linux system, with R version
> R-2.11.1.
>
> Thanks!
More information about the R-help
mailing list