[R] Parallelizing cor() for large data-set using Cluster

kparamas kparamas at asu.edu
Sat Jan 29 05:18:25 CET 2011


I am running my code in a cluster at Arizona State University.

I have a huge climate data, 
66000 X 500

I am not sure if I can find correlation of such a huge data in the cluster.
Normally I allocate 20000M and operate on 5 X 20000.
Even this is taking lot of time. Is there any way I can find
cl = cor(cdata) utilizing the computers in the clusters(I am using 32 nodes

I am using the following code to get the cluster..
cl <- makeCluster(64)
clusterExport(cl, "fakeData")
clusterEvalQ(cl, library(boot))
system.time(out2 <- clusterApplyLB(cl, pair, geneCor))

But here the geneCor and pair is calculated as,
pair <- combn(1:nrow(fakeData), 2, simplify = F)

geneCor <- function(x, gene = fakeData) { cor(t(gene[x[1], ]), t(gene[x[2],
])) }
#This calculates for each pairs. But I want cor(data) for a 2D matrix to be

View this message in context: http://r.789695.n4.nabble.com/Parallelizing-cor-for-large-data-set-using-Cluster-tp3245821p3245821.html
Sent from the R help mailing list archive at Nabble.com.

More information about the R-help mailing list