[R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values

Matthew Keller mckellercran at gmail.com
Fri Mar 16 19:30:18 CET 2007


Hi,

Many thanks to Jim and Martin for their suggestions. Using your ideas,
I came up with a solution that indexes rather than uses sapply (and
therefore calling up mvrnorm separately for each cell in the matrix).
The trick is to create a "key" matrix once, and then to use the
match() function each time I need to take the results from the key
matrix and place them in the appropriate spots in an 'effects' matrix.
If anyone is interested, here is a solution which speeds up the
process by a factor of 200 (!) :

unique.allele.seq <- 1:10000

make.effects <- function(allele.seq, seed, mn = 0, var=1, cov=.5) {
   set.seed(seed)
   return(mvrnorm(length(allele.seq),
mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),
empirical=FALSE))}

effects.key <- make.effects(unique.allele.seq, 123)

(alleles <- matrix(c(15,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),ncol=4))
(indx <- match(alleles,key))

(a1 <- matrix(effects.key[indx,1],ncol=ncol(alleles)))
(a2 <- matrix(effects.key[indx,2],ncol=ncol(alleles)))

#to check timing
system.time({
alleles <- matrix(rep(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),10000),ncol=4)
indx <- match(alleles,key)

a1 <- matrix(effects.key[indx,1],ncol=ncol(alleles))
a2 <- matrix(effects.key[indx,2],ncol=ncol(alleles))})




On 3/16/07, jim holtman <jholtman at gmail.com> wrote:
> Considering that the vast majority of your time is spent in the function
> mvrnorm (on my system 5.7  out of 6.1 seconds).  In your example that is
> 12000 calls to the function.  To improve your speed you have to cut down the
> number of calls to the function.  For example, how many unique integers do
> you have and can to do the calls for those and then substitute matching
> values.  Here is what Rprof showed:
>
>                   total.time total.pct self.time self.pct
> system.time             6.12      99.7      0.00      0.0
> as.vector               6.06      98.7      0.18      2.9
> FUN                     6.06      98.7      0.12      2.0
> array                   6.06      98.7      0.10      1.6
> lapply                  6.06      98.7      0.00      0.0
> sapply                  6.06      98.7      0.00      0.0
> eval                    6.04      98.4      0.06      1.0
> mvrnorm                 5.72      93.2      0.34      5.5
> eigen                   2.58      42.0      0.52      8.5
>
> or another way of looking at it:
>
>   0   6.1 root
>   1.    6.1 system.time
>   2. .    6.0 eval
>   3. . .    6.0 eval
>   4. . . .    6.0 array
>   5. . . . .    6.0 as.vector
>   6. . . . . .    6.0 sapply
>   7. . . . . . .    6.0 lapply
>   8. . . . . . . .    6.0 FUN
>   9. . . . . . . . .    5.7 mvrnorm
>  10. . . . . . . . . .    2.6 eigen
>  11. . . . . . . . . . .    1.2 sort.list
>  12. . . . . . . . . . . .    1.0 match.arg
>  13. . . . . . . . . . . . .    0.7 eval
>
>
>
>
> On 3/16/07, Matthew Keller <mckellercran at gmail.com> wrote:
> >
> > Hi all,
> >
> > [this is a bit hard to describe, so if my initial description is
> > confusing, please try running my code below]
> >
> > #WHAT I'M TRYING TO DO
> > I'd appreciate any help in trying to speed up some code. I've written
> > a script that converts a matrix of integers (usually between 1-10,000
> > - these represent allele names) into two new matrices of normally
> > distributed values (representing genetic effects), such that a given
> > number in the integer (allele name) matrix always corresponds to the
> > *same* randomly distributed (genetic) effects.
> >
> > For example, every time my function sees the allele name "3782", it
> > converts that integer into the same two effects (e.g., -.372  1.727),
> > which follow normal distributions (these effects can be correlated;
> > below I've set their correlation to .5). I have an entire matrix of
> > integers, and am converting those into two entire matrices of effects.
> >
> >
> > #HOW I'VE DONE IT SO FAR
> > To get the correlations between the effects, I've used the mvrnorm
> > function from "MASS"
> >
> > To convert the allele names to genetic effects, I've created a
> > function (make.effect) that resets the set.seed() to the allele name
> > each time its called.
> >
> > To get the matrix of genetic effects, I use sapply.
> >
> >
> > #THE PROBLEM
> > The problem is that I often need to convert matrices that have 500K
> > cells, and do this over several hundred iterations, so it is quite
> > slow. If anyone has ideas to speed this up (e.g., some clever way to
> > convert a matrix of integers to a matrix of effects without using
> > sapply), I would be forever indebted.
> >
> >
> > ##MY CODE
> >
> > library("MASS")
> >
> > ##run this example to see what I'm talking about above
> >
> > make.effects <- function(x,mn=0,var=1,cov=.5){
> > set.seed(x)
> >
> return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))}
> >
> > (alleles <-
> matrix(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),ncol=4))
> >
> > a12 <-
> array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
> > (a1 <- a12[1,,])
> > (a2 <- a12[2,,])
> >
> > #I've set the population correlation = .5; empirical corr=.4635
> > cor(as.vector (a1),as.vector(a2))
> >
> > ##run this to see that the code begins to get pretty slow with even a
> > 3000x4 matrix
> >
> > system.time({
> >
> > alleles <-
> matrix(rep(c(5400,3309,2080,1080,2080,1111,389,9362,6372,3787,2798,1009),1000),ncol=4)
> >
> > a12 <-
> array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
> > a1 <- a12[1,,]
> > a2 <- a12[2,,]
> >
> > })
> >
> >
> >
> >
> > --
> > Matthew C Keller
> > Postdoctoral Fellow
> > Virginia Institute for Psychiatric and Behavioral Genetics
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?


-- 
Matthew C Keller
Postdoctoral Fellow
Virginia Institute for Psychiatric and Behavioral Genetics



More information about the R-help mailing list