Hi Naomi,
Yes, you are right, my question was not formulated accurately enough. Of course, if you add information you would expect better estimates for the model. I just want (i) to find out how it works within LIMMA and also (ii) to understand what information, i.e. which treatments, one can add to increase rather than decrease the accuracy of the inferences. For example, if we had an untreated cell line and the same cells under some shock treatment, should we or could we mix them together?
I would really appreciate your comments on both questions above.
Thank you for help!
With kind regards,
Lev.
Naomi Altman wrote:
"The latter seems a bit strange to me, because the number of genes
classified as differentially expressed in one comparison (contrast)
should not depend on the genes differentially expressed in some other
comparison (contrast)."
Welcome to statistics! This is a feature of all statistical
analyses. e.g. You do one test only and use alpha=0.05 to test,
rejecting with p=0.049. But if you do 10,000 tests you will use a
multiple comparisons procedure, and your original test will no longer
be rejecting. Bayesian analysis also depends on the entire set of
tests, because the posterior depends on all the data.
--Naomi
At 04:48 AM 6/18/2007, Lev Soinov wrote:
> Dear Gordon and List,
>
> I would very much appreciate your comment on the experiment
> design in LIMMA. It is about processing of experiments with
> multiple treatments.
>
> Let's say we have a simple Affy experiment with 16 samples
> collected from a cell line (treated/untreated) in two time points:
> - 4 treated, 4 untreated - time point 1
> - 4 treated, 4 untreated - time point 2
> We are interested in differential expression between treated and
> untreated cells, in point1 and point2 separately.
> When we process all samples together (normalise them together and
> fit linear fit models using the whole dataset) in LIMMA we will get
> results different from when we process data for points 1 and 2
> separately (normalise them together but fit liner models separately).
>
> I do understand that it should be like this (more information for
> priors), but I do not know whether there is some kind of a
> criterion helping decide whether to process them separately or in
> one go. It seems that adding more treatments into the mix increases
> statistical power and thus, increases the number of genes
> classified as differentially expressed.
> The latter seems a bit strange to me, because the number of genes
> classified as differentially expressed in one comparison (contrast)
> should not depend on the genes differentially expressed in some
> other comparison (contrast).
>
> I have tried this on real data, processing data for different
> time points separately and together. I found that t and p values
> and also ordering of genes were different.
>
> To illustrte this further I created a simple example in R to
> demonstrate the effect of adding more treatments:
> I have generated a random matrix with 12 columns (4 treatments, 3
> replicates each), first 3 genes are upregulated in the first treatment.
> Then I ran linear fit + eBayes (i) on the first 6 columns, (ii)
> on the first 9 columns and (iii) on all 12 columns, while only
> testing the contrast between the first two treatments. My code and
> results are below.
> topTable results, below, show that by adding treatments we change
> estimates not only for moderated t statistics but also for ordinary ones.
> Could you, please, help me clarify this? I'd like to understand
> why information about
> treatments that may have no influence on the contrast of interest
> (in my example it may be the same cell line: untreated plus treated
> with 3 different chemical compounds, and we test ONLY the
> difference between untreated cells and cells treated with compound
> 1, leaving treatments with the other two compaunds out) affects
> lmFit and eBayes results for this contrast so much.
> Thank you very much for your help.
> With kind regards,
> Lev Soinov.
>
> Code in R:
> set.seed(2004); invisible(runif(100))
> M1 <- matrix(rnorm(100*12,sd=0.3),100,12)
> M<-M1[,1:6]
> M[1:3,1:3] <- M[1:3,1:3] + 2
> design<-model.matrix(~0 +factor(c(1,1,1,2,2,2)))
> colnames(design) <- c("group1", "group2")
> contrast.matrix <- makeContrasts(group2-group1, levels=design)
> fit <- lmFit(M, design)
> fit2 <- contrasts.fit(fit, contrast.matrix)
> ordinary.t1 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma
> fit2 <- eBayes(fit2)
> topTable(fit2, coef=1, adjust="BH")
>
> M<-M1[,1:9]
> M[1:3,1:3] <- M[1:3,1:3] + 2
> design<-model.matrix(~0 +factor(c(1,1,1,2,2,2,3,3,3)))
> colnames(design) <- c("group1", "group2", "group3")
> contrast.matrix <- makeContrasts(group2-group1, levels=design)
> fit <- lmFit(M, design)
> fit2 <- contrasts.fit(fit, contrast.matrix)
> ordinary.t2 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma
> fit2 <- eBayes(fit2)
> topTable(fit2, coef=1, adjust="BH")
>
> M<-M1[,1:12]
> M[1:3,1:3] <- M[1:3,1:3] + 2
> design<-model.matrix(~0 +factor(c(1,1,1,2,2,2,3,3,3,4,4,4)))
> colnames(design) <- c("group1", "group2", "group3", "group4")
> contrast.matrix <- makeContrasts(group2-group1, levels=design)
> fit <- lmFit(M, design)
> fit2 <- contrasts.fit(fit, contrast.matrix)
> ordinary.t3 <- fit2$coef / fit2$stdev.unscaled / fit2$sigma
> fit2 <- eBayes(fit2)
> topTable(fit2, coef=1, adjust="BH")
>
> Ordinary t statistics appeared to be very different for different
> runs, and only the first ordinary.t1 actually corresponds to the
> ordinary t statictics that can be calculated in Excel:
> > ordinary.t1[1:3,]
> [1] -6.260828 -16.707178 -16.503064
> > ordinary.t2[1:3,]
> [1] -7.023479 -19.172993 -14.772837
> > ordinary.t3[1:3,]
> [1] -7.114086 -10.479982 -15.143921
>
> eBayes statistics are even more different:
> 1. first 6 columns:
> logFC t P.Value adj.P.Val B
> 3 -2.2243983 -9.922826 6.716099e-09 5.502173e-07 10.597497
> 2 -2.1486277 -9.618087 1.100435e-08 5.502173e-07 10.096915
> 1 -1.9143686 -7.434466 5.280488e-07 1.760163e-05 6.162366
>
> 2. first 9 columns:
> logFC t P.Value adj.P.Val B
> 3 -2.2243983 -10.415452 1.066367e-09 5.478927e-08 12.339299
> 2 -2.1486277 -10.399332 1.095785e-09 5.478927e-08 12.311757
> 1 -1.9143686 -7.781305 1.382557e-07 4.608522e-06 7.403517
>
> 3. first 12 columns:
> logFC t P.Value adj.P.Val B
> 3 -2.2243983 -9.423325 4.011632e-14 4.011632e-12 21.717523
> 2 -2.1486277 -8.919133 3.398312e-13 1.699156e-11 19.599063
> 1 -1.9143686 -7.721552 5.585622e-11 1.861874e-09 14.547154
>
>
>---------------------------------
>
> [[alternative HTML version deleted]]
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348 (Statistics)
University Park, PA 16802-2111
---------------------------------
[[alternative HTML version deleted]]