[BioC] Computing large correlations in R
Paul [guest]
guest at bioconductor.org
Mon Sep 30 08:57:07 CEST 2013
I have two list of lists A and B, A and B contain 100 data frames each and the dimension of each data frame is 15000 X 15000. I would like to find the correlation for the entire data frame in the following way: Consider the first list in both lists and find cor (A,B) and get a single value correlating the entire dataframe. Similarly consider the second list in both lists and find cor(A,B) and continue this for the 100 dataframes.
I tried the following:
A # list of 100 dataframes
B #list of 100 dataframes
C<- A[1] # extract only the first list from A
D<- B[1] # extract only the first list from B
C<-unlist(C) ### unlist C
D<-unlist(D) ## unlist D
Then computed
Correlation<- cor(C,D) ## to obtain a single correlation coefficient to see how these two vectors are correlated
But I end up with the error sayin
R cannot allocate a vector of size 3.9 GB
Is there a better way to do this in faster way which could be implemented to the entire list. I work on a server which allows me to compute large values but it still shows up this error and the unlisting takes ages because of the size of the dataframe.
-- output of sessionInfo():
R version 3.0.1 (2013-05-16)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list