[BioC] 2x3 factorial experiment
James W. MacDonald
jmacdon at uw.edu
Wed May 29 17:50:37 CEST 2013
Hi Ilario,
On 5/28/2013 12:49 PM, Ilario De Toma wrote:
> I have performed a microarray experiment comparing primary monocyte from
> mouse fetal livers.
>
> I have three KO samples vs. three WT samples (biological replicates). Each
> sample have been splitted into three plates:
> - 1/3 (P) have been collected while proliferating (bacteria plate with
> GM-CSF)
> -1/3 (U) have been let untreated
> -1/3 (LPS) have been treated with LPS
>
> Here is a summary of my experiment:
>
> Status treat liver
> WT1.P WT U 1
> KO1.P KO U A
> WT2.P WT U 2
> KO2.P KO U B
> WT3.P WT U 3
> KO3.P KO U C
> WT1.U WT U 1
> KO1.U KO U A
> WT2.U WT U 2
> KO2.U KO U B
> WT3.U WT U 3
> KO3.U KO U C
> WT1.LPS WT LPS 1
> KO1.LPS KO LPS A
> WT2.LPS WT LPS 2
> KO2.LPS KO LPS B
> WT3.LPS WT LPS 3
> KO3.LPS KO LPS C
>
> The main contrasts that i want to analyze are:
> -the difference in basal levels of proliferating cells (KO.P-WT.P)
> - the difference between genes activated by LPS in KO vs WT:
> (KO.LPS-KO.U) - (WT.LPS -WT.U)
>
> I did an MDSplot on my samples and they are perfectly separated: the first
> dimension separates LPS vs. U or P; while the second dimension separates Wt
> vs. KO. Moreover, between WT and KO groups also U and P are separated along
> that dimension.
>
> So, now I have some statistical question when fitting a linear model with
> limma:
> 1)Is it correct to fit the linear model considering all the samples? Even
> though for example LPS treated sample are completely different from
> untreated?
They aren't completely different. They are the same monocytes that have
been subjected to a different environmental stressor. To answer your
question, yes it is correct.
> 2)Should I consider the experiment with proliferating cells (P) a separate
> one and fit an independent linear model (considering that I am not
> particularly interested in the differences U-P or LPS - P.
I would keep everything together. The difference is subtle, but
important. The denominator of your contrast is based on an overall
estimate of intra-group variability. If you have all three sample types
in the model, that increases the number of samples with which you
estimate this variability. Since the sample variance is a fairly
inefficient statistic (meaning it takes relatively more data before it
converges to the parameter it is intended to estimate), increasing the
amount of data used to estimate intra-group variability will tend to
improve power to detect differences.
> 3) Are the statistics influenced by the number of contrasts you investigate
> when calling the "contrasts.fit" and "eBayes" functions?
Not sure what you mean here. What statistics? Influenced how?
I'll take a stab at an answer, however. You might be worried about the
increase in multiplicity with increasing numbers of contrasts. This is
an issue to a certain extent, but can be ameliorated by adjusting for
multiplicity with e.g., FDR, which isn't affected by the number of
contrasts.
In other words, if you have two contrasts that are FDR adjusted and you
select genes with an adjusted p-value of 0.05, the entire set of genes
you select are estimated to contain at most 5% false positives. If you
increase the number of contrasts, the proportion of false positives
doesn't increase.
> 3) When calling lmFit should I "block" for fetal liver origin? (considering
> that I split each fetal liver in three plates forming P, U and LPS) even
> though cells do not cluster for fetal liver origin).
It's hard to say. In conventional statistics one tends to fit various
models and then chooses the one that explains the data 'the best' while
trying to be as parsimonious as possible. In microarray analyses we
don't have that luxury, as we are fitting thousands of models at the
same time, so each cannot be tailored specifically to each gene. In
other words, we fit a single model to all of the genes in a rather naive
manner rather than different models based on how the model fits a
particular gene.
So in general we often fit what are arguably over-parameterized models
because we don't want to take the chance that there are certain genes
for which we need all the parameters in our model. In other words, in
aggregate there doesn't appear to be a real need to block on mouse with
your data. This doesn't tell us if there are certain genes that really
do vary by source mouse, and would benefit from having the blocking
variables in the model.
This brings us to the topic of assumptions. You can assume that you
don't really need to block on mouse, based on the results of
duplicateCorrelations returning a low correlation. That isn't a horrible
assumption, and could be defended, especially since you will 'save' some
degrees of freedom by not blocking. Or you could think that maybe there
are some genes that really need the blocking variable and you are
willing to reduce your degrees of freedom. I think that is a reasonable
assumption as well.
But you wouldn't ever both block and add the correlation. You do one or
the other. Blocking requires fewer assumptions than fitting a mixed
model (which is what you do if you pass a correlation to lmFit). And if
you fit a blocking variable, you have to increase the number of rows of
your contrasts matrix to account for that (which is why you get the
error below). There are examples in the limma User's Guide that show how
to handle a blocking variable.
Best,
Jim
> To that purpose first I calculated the correlation between samples
> originating from the same fetal liver (with duplicateCorrelation) and it
> was "0.15", afterwards I used lmFit putting the block and correlation
> variables into the formula, but when I used "contrasts.fit" I had the
> following error:
> Error in contrasts.fit(fit2, contrasts2) :
> Number of rows of contrast matrix must match number of coefficients
> In addition: Warning message:
> In contrasts.fit(fit2, contrasts2) :
> row names of contrasts don't match col names of coefficients
>
> Sorry for all these questions, initially I treated my experiment as a 2x3
> factorial, after I conducted two separates analysis one for proliferating
> cells and another one as a 2x2 factorial, but now I am not sure on my
> statistics.
> I hope to have been clear.
> Ilario, PhD Student
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list