[BioC] programming problem: running many ANOVAs

Francois Collin fcollin at sbcglobal.net
Mon Apr 19 19:49:35 CEST 2004


Yes, there is a way to avoid all of the overhead in
lm().  As you are fitting the same model to the data
for all genes, you should only have to get the qr
decomposition of your design matrix once.  Here are
some ideas for code - with someting like this you
should get the same results that you would get out of
lm() in seconds.:

###################################
 ## get qr decomposition of design matrix
  x.mtx <- model.matrix(~ -1 + Visit:Treatment)
  x.qr<-qr(x.mtx)

  x.det<-prod(diag(x.qr$qr))
  if (x.det == 0) stop("Determinent is Zero!")

  # get unscaled covariance and degress of freedom
  x.cov.unscaled<-solve(t(qr.R(x.qr)) %*% qr.R(x.qr)) 
  x.df<-nrow(x.mtx)-x.qr$rank

  # Fit Model for all genes 
  # Y.mtx has one column for every gene
  $ rows of Y.mtx are observations with Visit,
Treatment attributes

  Fit.coef<-qr.coef(x.qr, Y.mtx)
  Fit.res<-qr.resid(x.qr, Y.mtx)

  Fit.sigma2<-matrix(apply(Fit.res, 2, 
                           FUN=function(y)
{t(y)%*%y/x.df}),
                        nrow=1)


  Fit.coef.var<-apply(Fit.sigma2, 2,
                       FUN=function(y)   
{y*diag(x.cov.unscaled)})
 
  Fit.coef.t<-Fit.coef/sqrt(Fit.coef.var)

   # # # etc. 


--- Arne.Muller at aventis.com wrote:
> Hello,
> 
> I've a R/bioconductor programming question. For each
> gene in a matrix (emat) generated by exprs() I'm
> running an anova based on a goven model. The
> p-values for the anova and the coefficients from the
> lm for each gene are stored in a list and are
> returned.
> 
> A typical emat contains 12,000 rows and 40 colums or
> so. The probelm is that it runs about 10 minutes (P4
> 2.8GHz under linux with 2GB RAM) to complete.
> 
> I wonder if there is a more effifient way to program
> this routine (running the 12,000 lm's and ANOVAs).
> Is there a way to get rid of the generation of the
> data.frame that I pass to the lm?
> 
> I call my routine like this (dose and batch are two
> factors):
> 
> > dif.04h <- arnova(emat.04h, list(dose, batch),
> Value ~ dose + batch)
> [1] "1000"           "genes done ..."
> [1] "2000"           "genes done ..."
> [1] "3000"           "genes done ..."
> [1] "4000"           "genes done ..."
> [1] "5000"           "genes done ..."
> [1] "6000"           "genes done ..."
> [1] "7000"           "genes done ..."
> [1] "8000"           "genes done ..."
> [1] "9000"           "genes done ..."
> [1] "10000"          "genes done ..."
> [1] "11000"          "genes done ..."
> [1] "12000"          "genes done ..."
> 
> here's the routine:
> 
> arnova <- function(emat, factors, model, contr=NULL)
> {
>       genes <- rownames(emat) # gene names
>       l <- length(genes) # number of genes
>       difflabels <- attr(terms(model),"term.labels")
> # factors from model
> 	diff <- list() # stores an anova(lm) result
>       coef <- list() # stores an summary(lm)
> 
>       ### loop over all rows (genes) and calculate
> the lm & anova,
> 	### the p-values and coefficients for each factor
> are stored
> 	for ( i in 1:l ) {
> 
>           ### the input data frame for this gene
>           d <- data.frame(Value = emat[i, ],
> factors)
>      	    fit <- lm(model, data = d,
> contrasts=contr)
>           genecoef <-
> summary(fit)$coefficients[-1,4] # coefficients
>           genediff <-
> as.vector(na.omit(anova(fit)$'Pr(>F)')) # anova
> p-values
>           names(genediff) <- difflabels # the factor
> names
>           diff[[genes[i]]] <- genediff
>           coef[[genes[i]]] <- genecoef
>           
>           ###print progress
>           if ( !(i%%1000) ) { print(c(i, 'genes done
> ...')) }
> 	}
> 
> 	list(anova=diff, coefs=coef)
> }
> 
> A result look slike this:
> 
> > dif.04h$'anova'[1:3]
> $"100001_at"
>      dose     batch 
> 0.5245583 0.2333894 
> 
> $"100002_at"
>         dose        batch 
> 9.635027e-01 3.193364e-05 
> 
> $"100003_at"
>         dose        batch 
> 7.374520e-01 4.639076e-06 
> 
> > dif.04h$'coef'[1:3]
> $"100001_at"
> dose010mM dose025mM dose050mM dose100mM  batchOLD 
> batchPRG 
> 0.6895026 0.2869963 0.7480160 0.4776234 0.1204537
> 0.1837618 
> 
> $"100002_at"
>    dose010mM    dose025mM    dose050mM    dose100mM 
>    batchOLD     batchPRG 
> 6.192953e-01 8.952266e-01 8.113848e-01 9.412216e-01
> 1.515069e-05 4.904563e-04 
> 
> $"100003_at"
>    dose010mM    dose025mM    dose050mM    dose100mM 
>    batchOLD     batchPRG 
> 9.574872e-01 8.735150e-01 4.739319e-01 2.870358e-01
> 9.635836e-06 1.549058e-05 
> 
> Any suggestions to speed this up?
> 
> 	thanks a lot for your help,
> 	+regards,
> 
> 	Arne
> 
> --
> Arne Muller, Ph.D.
> Toxicogenomics, Aventis Pharma
> arne dot muller domain=aventis com
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
>
https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor



More information about the Bioconductor mailing list