[BioC] Help on factorial experiment analysis using limma

Mon Sep 15 18:12:52 MEST 2003

Dear Fai,

> This reminds me a very general question with regard to replication I have
> had for a while. What is the proper way to analyze the replicated data if
> there are both biological replication and technical replication in the raw
> data? Consider an example in which there are a hundred samples from
> different cancer patients and the microarray experiment for each sample is
> repeated three times. I heard some people would treat both biological and
> technical replicates equally in this case. But isn't it true that the
> technical replicates would have smaller variance and are somehow related
> with each other and should be treated differently?

I think so. Gordon and I had an email exchange about these issues on late 
august (19 to 21). Just in case this could be of any help, this is what I 
ended up doing.

I had a design with treatment and control hybridized on the same array. There 
were 3 biological replicates and 2 technical replicates for each biological 
replicate (with dye swap on the technical repl). Then, this is very much like 
a one-sample t-test (ignoring the dye-swap and technical replicates issues).

My first try was to use linear mixed effects models (with library nlme); fit a 
model to just the intercept, and then re-organize the output of lme, and use, 
for example, ebayes. However, probably because of the tiny sample sizes, I 
was occasionally getting funny results from lme that didn't seem to make 
sense. In addition there could be very differences between analyzing all data 
ignoring technical replication (3 df) or analyzing the means of the technical 
replicates (2 df) (and my collaborators had made clear that they needed, for 
other various reasons, "hard rules" based on p-values). Thus, I finally 
decided to obtain the means of the technical replicates, and analyze the data 
as in the swirl example (like a one-sample t-test); this is a conservative 
approach and my collaborators and I agreed that it would be a lot easier to 
explain and to defend, specially with such small sample sizes.

Of course, if everything works as it should, the estimates of the t-statistic 
from using the t-test on the means of replicates and the lme t-statistic 
should be very, very similar, but because of the differences in the df we can 
end up with differences in the p-values.

Below is a function I concocted quickly to see how things differed between 
limma, lme and t-test on the means of the technical replicates. I used it to 
try to understand what might happen.

Best,

Ramón
*********************************************

library(nlme)
library(limma)
options(contrasts = c("contr.helmert", "contr.poly"))

f3 <- function(effect.size = 0,
               s.biol = 2, s.tech = 1,
               biological.repls = 5, n.tech.reps = 10){

    ## For one-sample comparisons, with dye-swaps
    ## generates data, and fits models using both limma
    ## and lme and one-sample t-tests on the mean of the
    ## technical replicates

    if(getOption("contrasts")[1] != "contr.helmert")
        stop("You need to set Helmert contrasts in options")
    if(n.tech.reps %% 2) stop("Need even number of n.tech.reps")
    ## simulate data
    tech.rep.effect <- rnorm(biological.repls, mean = 0, sd = s.biol)
    tech.rep.effect <- rep(tech.rep.effect,
                           rep(n.tech.reps, biological.repls))
    data.l <- data <-
        rnorm(biological.repls * n.tech.reps, mean = 0, sd = s.tech) +
            tech.rep.effect + effect.size
    dim(data.l) <- c(1, biological.repls * n.tech.reps)

    ## design matrices, etc.
    tech.reps <-
        factor(rep(1:biological.repls, rep(n.tech.reps, biological.repls)))
    dm1 <- model.matrix(data ~ tech.reps)
    dye.swap.codes <- rep(c(-1, 1), (n.tech.reps * biological.repls)/2)

    data.sp <- data.l * dye.swap.codes
    data.s <- data * dye.swap.codes

    ## model fitting
    tmp1 <- lm.series(data.l, dm1)
    tmp2 <- lm.series(data.sp, dm1 * dye.swap.codes)
    tmp3 <- summary(lme(data ~ 1, random = ~ 1|tech.reps,
                    control = list(tolerance = 1e-10, msTol = 1e-10)))
    tmp4 <- summary(lme(data.s ~ -1 + dye.swap.codes,
                        random = ~ 1|tech.reps,
                        control = list(tolerance = 1e-10, msTol = 1e-10)))

    tmp1 <- c(tmp1$coefficients[[1]], tmp1$stdev.unscaled[[1]],
              tmp1$sigma, tmp1$sigma *tmp1$stdev.unscaled[[1]], tmp1$df)
    names(tmp1) <- c("coeff", "unscaled.stdev", "sigma", "scaled se", "df")
    tmp2 <- c(tmp2$coefficients[[1]], tmp2$stdev.unscaled[[1]],
              tmp2$sigma, tmp2$sigma *tmp2$stdev.unscaled[[1]], tmp2$df)

    cat("\n\n *************\n\n")
    mean.by.tech.rep <- tapply(data, tech.reps, mean)
    the.t.results <- unlist(t.test(mean.by.tech.rep)[1:6])
    the.t.se <- sqrt(var(mean.by.tech.rep)/length(unique(tech.reps)))
    lme.and.t <- rbind(lme.positivized.t.table = tmp3$tTable,
                       lme.swap.design.t.table = tmp4$tTable,
                       t.test.results = c(the.t.results[6], the.t.se,
                       the.t.results[c(2, 1, 3)]))
    rownames(lme.and.t) <- c("lme.positivized", "lme.swap", "t.test")

    return(
           lme.and.limma = rbind(
           lm.series.positivized = tmp1,
           lm.series.swap.design = tmp2, 
           lme.positivized = c(tmp3$coefficients[[1]],
           sqrt(tmp3$varFix)/tmp3$sigma, tmp3$sigma, sqrt(tmp3$varFix), 
tmp3$tTable[[3]]),
           lme.swap.design = c(tmp4$coefficients[[1]],
           sqrt(tmp4$varFix)/tmp4$sigma, tmp4$sigma, sqrt(tmp4$varFix), 
tmp4$tTable[[3]])),
           lme.and.t)
}

## an example call:
f3(2, 2, 1, 3, 2)

>
> > More good questions. The design matrix that you've written above
> > corresponds to a classical interaction parametrization. Here the column
> > 'ab' corresponds to extra effect that 'a' has in the presence of 'b'. The
> > effect of 'a' by itself (a0-00) is represented by the coefficient 'a' and
> > the effect of 'a' in the presence of 'b' (ab-0b) is represented by the
> > sum of the coefficients for 'a' and 'ba'. If 'b' is a confounding factor,
> > then you probably want to have the effects for 'a' with and without 'b'
> > in your heatdiagram. You could do this by
> >
> > fit <- lm.series(MA, design)
> > contrast.matrix <- makeContrasts(a,a+ba,levels=design)
> > fit2 <- contrasts.fit(fit, contrast.matrix)
> > eb2 <- ebayes(fit2)
> > heatdiagram(stat=eb2$t,coef=fit2$coef)
> >
> > This would show whether genes which respond to factor 'a' still respond
> > to 'a' in the presence of 'b', and whether in the same direction. Note
> > that makeContrasts() is available only in the development version of
> > limma.
>
> Could you give me further assistance on understanding the biological
> meaning of the heatdiagram? I know red means the gene is upregulated and
> green means the gene is downregulated. Does white color mean the
> statistical test is not significant for that gene? In my particular
> example, what does it mean when the gene i is
>
> i. red for a and green/less red for a+ba? Does it mean b is suppressing the
> effect of a on this gene?
> ii. red for a and white for a+ba? Does it mean b has no effect on this gene
> while a is upregulating it?
> iii. red for a and more red for a+ba? Does it mean a and b are both
> upregulating gene i?
> (I won't consider the green situations for a because they are essentially
> the same)
>
> This question sounds a little bit stupid, but I believe a step-by-step
> guide is actually very helpful for biologist like me to understand the
> result. It would be nice to have more tutorials on how to use heatdiagram,
> make contrast, venndiagram etc to interpret the biological meaning of
> analysis result in the manual of your future version of limma.
>
> Thanks very much!
>
> Best regards,
> Fai
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://bioinfo.cnio.es/~rdiaz