[R] Speeding up resampling of rows from a large matrix
Juan Pablo Lewinger
lewinger at usc.edu
Fri May 25 08:04:28 CEST 2007
I'm trying to:
Resample with replacement pairs of distinct rows from a 120 x 65,000
matrix H of 0's and 1's. For each resampled pair sum the resulting 2
x 65,000 matrix by column:
0 1 0 1 ...
+
0 0 1 1 ...
_______
= 0 1 1 2 ...
For each column accumulate the number of 0's, 1's and 2's over the
resamples to obtain a 3 x 65,000 matrix G.
For those interested in the background, H is a matrix of haplotypes,
each pair of haplotypes forms a genotype, and each column corresponds
to a SNP. I'm using resampling to compute the null distribution of
the maximum over correlated SNPs of a simple statistic.
The code:
#-------------------------------------------------------------------------------
nSNPs <- 1000
H <- matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120)
G <- matrix(0, nrow=3, ncol=nSNPs)
# Keep in mind that the real H is 120 x 65000
nResamples <- 3000
pair <- replicate(nResamples, sample(1:120, 2))
gen <- function(x){g <- sum(x); c(g==0, g==1, g==2)}
for (i in 1:nResamples){
G <- G + apply(H[pair[,i],], 2, gen)
}
#-------------------------------------------------------------------------------
The problem is that the loop takes about 80 mins to complete and I
need to repeat the whole thing 10,000 times, which would then take
over a year and a half!
Is there a way to speed this up so that the full 10,000 iterations
take a reasonable amount of time (say a week)?
My machine has an Intel Xeon 3.40GHz CPU with 1GB of RAM
> sessionInfo()
R version 2.5.0 (2007-04-23)
i386-pc-mingw32
I would greatly appreciate any help.
Juan Pablo Lewinger
Department of Preventive Medicine
Keck School of Medicine
University of Southern California
More information about the R-help
mailing list