[BioC] programming problem: running many ANOVAs
Arne.Muller at aventis.com
Arne.Muller at aventis.com
Mon Apr 19 18:51:36 CEST 2004
Hello,
I've a R/bioconductor programming question. For each gene in a matrix (emat) generated by exprs() I'm running an anova based on a goven model. The p-values for the anova and the coefficients from the lm for each gene are stored in a list and are returned.
A typical emat contains 12,000 rows and 40 colums or so. The probelm is that it runs about 10 minutes (P4 2.8GHz under linux with 2GB RAM) to complete.
I wonder if there is a more effifient way to program this routine (running the 12,000 lm's and ANOVAs). Is there a way to get rid of the generation of the data.frame that I pass to the lm?
I call my routine like this (dose and batch are two factors):
> dif.04h <- arnova(emat.04h, list(dose, batch), Value ~ dose + batch)
[1] "1000" "genes done ..."
[1] "2000" "genes done ..."
[1] "3000" "genes done ..."
[1] "4000" "genes done ..."
[1] "5000" "genes done ..."
[1] "6000" "genes done ..."
[1] "7000" "genes done ..."
[1] "8000" "genes done ..."
[1] "9000" "genes done ..."
[1] "10000" "genes done ..."
[1] "11000" "genes done ..."
[1] "12000" "genes done ..."
here's the routine:
arnova <- function(emat, factors, model, contr=NULL)
{
genes <- rownames(emat) # gene names
l <- length(genes) # number of genes
difflabels <- attr(terms(model),"term.labels") # factors from model
diff <- list() # stores an anova(lm) result
coef <- list() # stores an summary(lm)
### loop over all rows (genes) and calculate the lm & anova,
### the p-values and coefficients for each factor are stored
for ( i in 1:l ) {
### the input data frame for this gene
d <- data.frame(Value = emat[i, ], factors)
fit <- lm(model, data = d, contrasts=contr)
genecoef <- summary(fit)$coefficients[-1,4] # coefficients
genediff <- as.vector(na.omit(anova(fit)$'Pr(>F)')) # anova p-values
names(genediff) <- difflabels # the factor names
diff[[genes[i]]] <- genediff
coef[[genes[i]]] <- genecoef
###print progress
if ( !(i%%1000) ) { print(c(i, 'genes done ...')) }
}
list(anova=diff, coefs=coef)
}
A result look slike this:
> dif.04h$'anova'[1:3]
$"100001_at"
dose batch
0.5245583 0.2333894
$"100002_at"
dose batch
9.635027e-01 3.193364e-05
$"100003_at"
dose batch
7.374520e-01 4.639076e-06
> dif.04h$'coef'[1:3]
$"100001_at"
dose010mM dose025mM dose050mM dose100mM batchOLD batchPRG
0.6895026 0.2869963 0.7480160 0.4776234 0.1204537 0.1837618
$"100002_at"
dose010mM dose025mM dose050mM dose100mM batchOLD batchPRG
6.192953e-01 8.952266e-01 8.113848e-01 9.412216e-01 1.515069e-05 4.904563e-04
$"100003_at"
dose010mM dose025mM dose050mM dose100mM batchOLD batchPRG
9.574872e-01 8.735150e-01 4.739319e-01 2.870358e-01 9.635836e-06 1.549058e-05
Any suggestions to speed this up?
thanks a lot for your help,
+regards,
Arne
--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com
More information about the Bioconductor
mailing list