Dear Gordon and List,
I would very much appreciate your comment on the experiment design in LIMMA. It is about processing of experiments with multiple treatments.
Let's say we have a simple Affy experiment with 16 samples collected from a cell line (treated/untreated) in two time points:
- 4 treated, 4 untreated - time point 1
- 4 treated, 4 untreated - time point 2
We are interested in differential expression between treated and untreated cells, in point1 and point2 separately.
When we process all samples together (normalise them together and fit linear fit models using the whole dataset) in LIMMA we will get results different from when we process data for points 1 and 2 separately (normalise them together but fit liner models separately).
I do understand that it should be like this (more information for priors), but I do not know whether there is some kind of a criterion helping decide whether to process them separately or in one go. It seems that adding more treatments into the mix increases statistical power and thus, increases the number of genes classified as differentially expressed.
The latter seems a bit strange to me, because the number of genes classified as differentially expressed in one comparison (contrast) should not depend on the genes differentially expressed in some other comparison (contrast).
I have tried this on real data, processing data for different time points separately and together. I found that t and p values and also ordering of genes were different.
To illustrte this further I created a simple example in R to demonstrate the effect of adding more treatments:
I have generated a random matrix with 12 columns (4 treatments, 3 replicates each), first 3 genes are upregulated in the first treatment.
Then I ran linear fit + eBayes (i) on the first 6 columns, (ii) on the first 9 columns and (iii) on all 12 columns, while only testing the contrast between the first two treatments. My code and results are below.
topTable results, below, show that by adding treatments we change estimates not only for moderated t statistics but also for ordinary ones.
Could you, please, help me clarify this? I'd like to understand why information about
treatments that may have no influence on the contrast of interest (in my example it may be the same cell line: untreated plus treated with 3 different chemical compounds, and we test ONLY the difference between untreated cells and cells treated with compound 1, leaving treatments with the other two compaunds out) affects lmFit and eBayes results for this contrast so much.
Thank you very much for your help.
With kind regards,
Lev Soinov.
Code in R:
set.seed(2004); invisible(runif(100))
M1 <- matrix(rnorm(100*12,sd=0.3),100,12)
M<-M1[,1:6]
M[1:3,1:3] <- M[1:3,1:3] + 2
design<-model.matrix(~0 +factor(c(1,1,1,2,2,2)))
colnames(design) <- c("group1", "group2")
contrast.matrix <- makeContrasts(group2-group1, levels=design)
fit <- lmFit(M, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
ordinary.t1 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma
fit2 <- eBayes(fit2)
topTable(fit2, coef=1, adjust="BH")
M<-M1[,1:9]
M[1:3,1:3] <- M[1:3,1:3] + 2
design<-model.matrix(~0 +factor(c(1,1,1,2,2,2,3,3,3)))
colnames(design) <- c("group1", "group2", "group3")
contrast.matrix <- makeContrasts(group2-group1, levels=design)
fit <- lmFit(M, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
ordinary.t2 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma
fit2 <- eBayes(fit2)
topTable(fit2, coef=1, adjust="BH")
M<-M1[,1:12]
M[1:3,1:3] <- M[1:3,1:3] + 2
design<-model.matrix(~0 +factor(c(1,1,1,2,2,2,3,3,3,4,4,4)))
colnames(design) <- c("group1", "group2", "group3", "group4")
contrast.matrix <- makeContrasts(group2-group1, levels=design)
fit <- lmFit(M, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
ordinary.t3 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma
fit2 <- eBayes(fit2)
topTable(fit2, coef=1, adjust="BH")
Ordinary t statistics appeared to be very different for different runs, and only the first ordinary.t1 actually corresponds to the ordinary t statictics that can be calculated in Excel:
> ordinary.t1[1:3,]
[1] -6.260828 -16.707178 -16.503064
> ordinary.t2[1:3,]
[1] -7.023479 -19.172993 -14.772837
> ordinary.t3[1:3,]
[1] -7.114086 -10.479982 -15.143921
eBayes statistics are even more different:
1. first 6 columns:
logFC t P.Value adj.P.Val B
3 -2.2243983 -9.922826 6.716099e-09 5.502173e-07 10.597497
2 -2.1486277 -9.618087 1.100435e-08 5.502173e-07 10.096915
1 -1.9143686 -7.434466 5.280488e-07 1.760163e-05 6.162366
2. first 9 columns:
logFC t P.Value adj.P.Val B
3 -2.2243983 -10.415452 1.066367e-09 5.478927e-08 12.339299
2 -2.1486277 -10.399332 1.095785e-09 5.478927e-08 12.311757
1 -1.9143686 -7.781305 1.382557e-07 4.608522e-06 7.403517
3. first 12 columns:
logFC t P.Value adj.P.Val B
3 -2.2243983 -9.423325 4.011632e-14 4.011632e-12 21.717523
2 -2.1486277 -8.919133 3.398312e-13 1.699156e-11 19.599063
1 -1.9143686 -7.721552 5.585622e-11 1.861874e-09 14.547154
---------------------------------
[[alternative HTML version deleted]]