[R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values
Matthew Keller
mckellercran at gmail.com
Fri Mar 16 19:30:18 CET 2007
Hi,
Many thanks to Jim and Martin for their suggestions. Using your ideas,
I came up with a solution that indexes rather than uses sapply (and
therefore calling up mvrnorm separately for each cell in the matrix).
The trick is to create a "key" matrix once, and then to use the
match() function each time I need to take the results from the key
matrix and place them in the appropriate spots in an 'effects' matrix.
If anyone is interested, here is a solution which speeds up the
process by a factor of 200 (!) :
unique.allele.seq <- 1:10000
make.effects <- function(allele.seq, seed, mn = 0, var=1, cov=.5) {
set.seed(seed)
return(mvrnorm(length(allele.seq),
mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),
empirical=FALSE))}
effects.key <- make.effects(unique.allele.seq, 123)
(alleles <- matrix(c(15,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),ncol=4))
(indx <- match(alleles,key))
(a1 <- matrix(effects.key[indx,1],ncol=ncol(alleles)))
(a2 <- matrix(effects.key[indx,2],ncol=ncol(alleles)))
#to check timing
system.time({
alleles <- matrix(rep(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),10000),ncol=4)
indx <- match(alleles,key)
a1 <- matrix(effects.key[indx,1],ncol=ncol(alleles))
a2 <- matrix(effects.key[indx,2],ncol=ncol(alleles))})
On 3/16/07, jim holtman <jholtman at gmail.com> wrote:
> Considering that the vast majority of your time is spent in the function
> mvrnorm (on my system 5.7 out of 6.1 seconds). In your example that is
> 12000 calls to the function. To improve your speed you have to cut down the
> number of calls to the function. For example, how many unique integers do
> you have and can to do the calls for those and then substitute matching
> values. Here is what Rprof showed:
>
> total.time total.pct self.time self.pct
> system.time 6.12 99.7 0.00 0.0
> as.vector 6.06 98.7 0.18 2.9
> FUN 6.06 98.7 0.12 2.0
> array 6.06 98.7 0.10 1.6
> lapply 6.06 98.7 0.00 0.0
> sapply 6.06 98.7 0.00 0.0
> eval 6.04 98.4 0.06 1.0
> mvrnorm 5.72 93.2 0.34 5.5
> eigen 2.58 42.0 0.52 8.5
>
> or another way of looking at it:
>
> 0 6.1 root
> 1. 6.1 system.time
> 2. . 6.0 eval
> 3. . . 6.0 eval
> 4. . . . 6.0 array
> 5. . . . . 6.0 as.vector
> 6. . . . . . 6.0 sapply
> 7. . . . . . . 6.0 lapply
> 8. . . . . . . . 6.0 FUN
> 9. . . . . . . . . 5.7 mvrnorm
> 10. . . . . . . . . . 2.6 eigen
> 11. . . . . . . . . . . 1.2 sort.list
> 12. . . . . . . . . . . . 1.0 match.arg
> 13. . . . . . . . . . . . . 0.7 eval
>
>
>
>
> On 3/16/07, Matthew Keller <mckellercran at gmail.com> wrote:
> >
> > Hi all,
> >
> > [this is a bit hard to describe, so if my initial description is
> > confusing, please try running my code below]
> >
> > #WHAT I'M TRYING TO DO
> > I'd appreciate any help in trying to speed up some code. I've written
> > a script that converts a matrix of integers (usually between 1-10,000
> > - these represent allele names) into two new matrices of normally
> > distributed values (representing genetic effects), such that a given
> > number in the integer (allele name) matrix always corresponds to the
> > *same* randomly distributed (genetic) effects.
> >
> > For example, every time my function sees the allele name "3782", it
> > converts that integer into the same two effects (e.g., -.372 1.727),
> > which follow normal distributions (these effects can be correlated;
> > below I've set their correlation to .5). I have an entire matrix of
> > integers, and am converting those into two entire matrices of effects.
> >
> >
> > #HOW I'VE DONE IT SO FAR
> > To get the correlations between the effects, I've used the mvrnorm
> > function from "MASS"
> >
> > To convert the allele names to genetic effects, I've created a
> > function (make.effect) that resets the set.seed() to the allele name
> > each time its called.
> >
> > To get the matrix of genetic effects, I use sapply.
> >
> >
> > #THE PROBLEM
> > The problem is that I often need to convert matrices that have 500K
> > cells, and do this over several hundred iterations, so it is quite
> > slow. If anyone has ideas to speed this up (e.g., some clever way to
> > convert a matrix of integers to a matrix of effects without using
> > sapply), I would be forever indebted.
> >
> >
> > ##MY CODE
> >
> > library("MASS")
> >
> > ##run this example to see what I'm talking about above
> >
> > make.effects <- function(x,mn=0,var=1,cov=.5){
> > set.seed(x)
> >
> return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))}
> >
> > (alleles <-
> matrix(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),ncol=4))
> >
> > a12 <-
> array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
> > (a1 <- a12[1,,])
> > (a2 <- a12[2,,])
> >
> > #I've set the population correlation = .5; empirical corr=.4635
> > cor(as.vector (a1),as.vector(a2))
> >
> > ##run this to see that the code begins to get pretty slow with even a
> > 3000x4 matrix
> >
> > system.time({
> >
> > alleles <-
> matrix(rep(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),1000),ncol=4)
> >
> > a12 <-
> array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
> > a1 <- a12[1,,]
> > a2 <- a12[2,,]
> >
> > })
> >
> >
> >
> >
> > --
> > Matthew C Keller
> > Postdoctoral Fellow
> > Virginia Institute for Psychiatric and Behavioral Genetics
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?
--
Matthew C Keller
Postdoctoral Fellow
Virginia Institute for Psychiatric and Behavioral Genetics
More information about the R-help
mailing list