[BioC] 2x3 factorial experiment

Wed May 29 17:50:37 CEST 2013

Hi Ilario,

On 5/28/2013 12:49 PM, Ilario De Toma wrote:
> I have performed a microarray experiment comparing primary monocyte from
> mouse fetal livers.
>
> I have three KO samples vs. three WT samples (biological replicates). Each
> sample have been splitted into three plates:
> - 1/3 (P) have been collected while proliferating (bacteria plate with
> GM-CSF)
> -1/3 (U) have been let untreated
> -1/3 (LPS) have been treated with LPS
>
> Here is a summary of my experiment:
>
>                       Status     treat   liver
> WT1.P             WT         U       1
> KO1.P             KO          U       A
> WT2.P             WT         U       2
> KO2.P             KO          U       B
> WT3.P             WT         U       3
> KO3.P              KO         U       C
> WT1.U             WT         U       1
> KO1.U             KO          U       A
> WT2.U             WT         U       2
> KO2.U              KO         U       B
> WT3.U             WT         U       3
> KO3.U             KO          U       C
> WT1.LPS         WT       LPS     1
> KO1.LPS         KO        LPS     A
> WT2.LPS         WT       LPS     2
> KO2.LPS         KO        LPS     B
> WT3.LPS        WT        LPS     3
> KO3.LPS         KO        LPS     C
>
> The main contrasts that i want to analyze are:
> -the difference in basal levels of proliferating cells (KO.P-WT.P)
> - the difference between  genes activated by LPS in KO vs WT:
> (KO.LPS-KO.U) - (WT.LPS -WT.U)
>
> I did an MDSplot on my samples and they are perfectly separated: the first
> dimension separates LPS vs. U or P; while the second dimension separates Wt
> vs. KO. Moreover, between WT and KO groups also U and P are separated along
> that dimension.
>
> So, now I have some statistical question when fitting a linear model with
> limma:
> 1)Is it correct to fit the linear model considering all the samples? Even
> though for example LPS treated sample are completely different from
> untreated?

They aren't completely different. They are the same monocytes that have 
been subjected to a different environmental stressor. To answer your 
question, yes it is correct.

> 2)Should I consider the experiment with proliferating cells (P) a separate
> one and fit an independent linear model (considering that I am not
> particularly interested in the differences U-P or LPS - P.

I would keep everything together. The difference is subtle, but 
important. The denominator of your contrast is based on an overall 
estimate of intra-group variability. If you have all three sample types 
in the model, that increases the number of samples with which you 
estimate this variability. Since the sample variance is a fairly 
inefficient statistic (meaning it takes relatively more data before it 
converges to the parameter it is intended to estimate), increasing the 
amount of data used to estimate intra-group variability will tend to 
improve power to detect differences.

> 3) Are the statistics influenced by the number of contrasts you investigate
> when calling the "contrasts.fit" and "eBayes" functions?

Not sure what you mean here. What statistics? Influenced how?

I'll take a stab at an answer, however. You might be worried about the 
increase in multiplicity with increasing numbers of contrasts. This is 
an issue to a certain extent, but can be ameliorated by adjusting for 
multiplicity with e.g., FDR, which isn't affected by the number of 
contrasts.

In other words, if you have two contrasts that are FDR adjusted and you 
select genes with an adjusted p-value of 0.05, the entire set of genes 
you select are estimated to contain at most 5% false positives. If you 
increase the number of contrasts, the proportion of false positives 
doesn't increase.
> 3) When calling lmFit should I "block" for fetal liver origin? (considering
> that I split each fetal liver in three plates forming P, U and LPS) even
> though cells do not cluster for fetal liver origin).

It's hard to say. In conventional statistics one tends to fit various 
models and then chooses the one that explains the data 'the best' while 
trying to be as parsimonious as possible. In microarray analyses we 
don't have that luxury, as we are fitting thousands of models at the 
same time, so each cannot be tailored specifically to each gene. In 
other words, we fit a single model to all of the genes in a rather naive 
manner rather than different models based on how the model fits a 
particular gene.

So in general we often fit what are arguably over-parameterized models 
because we don't want to take the chance that there are certain genes 
for which we need all the parameters in our model. In other words, in 
aggregate there doesn't appear to be a real need to block on mouse with 
your data. This doesn't tell us if there are certain genes that really 
do vary by source mouse, and would benefit from having the blocking 
variables in the model.

This brings us to the topic of assumptions. You can assume that you 
don't really need to block on mouse, based on the results of 
duplicateCorrelations returning a low correlation. That isn't a horrible 
assumption, and could be defended, especially since you will 'save' some 
degrees of freedom by not blocking. Or you could think that maybe there 
are some genes that really need the blocking variable and you are 
willing to reduce your degrees of freedom. I think that is a reasonable 
assumption as well.

But you wouldn't ever both block and add the correlation. You do one or 
the other. Blocking requires fewer assumptions than fitting a mixed 
model (which is what you do if you pass a correlation to lmFit). And if 
you fit a blocking variable, you have to increase the number of rows of 
your contrasts matrix to account for that (which is why you get the 
error below). There are examples in the limma User's Guide that show how 
to handle a blocking variable.

Best,

Jim

> To that purpose first I calculated the correlation between samples
> originating from the same fetal liver (with duplicateCorrelation) and it
> was "0.15", afterwards I used lmFit putting the block and correlation
> variables into the formula, but when I used "contrasts.fit" I had the
> following error:
> Error in contrasts.fit(fit2, contrasts2) :
>    Number of rows of contrast matrix must match number of coefficients
> In addition: Warning message:
> In contrasts.fit(fit2, contrasts2) :
>    row names of contrasts don't match col names of coefficients
>
> Sorry for all these questions, initially I treated my experiment as a 2x3
> factorial, after I conducted two separates analysis one for proliferating
> cells and another one as a 2x2 factorial, but now I am not sure on my
> statistics.
> I hope to have been clear.
> Ilario, PhD Student
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099