[R] Analysis of poorly replicated array data

Mon Jul 14 22:23:08 CEST 2008

Hi Eli

> I have "inherited" a cDNA macroarray dataset that is structured as follows.
> Three different stressors were tested.  For each stressor, there are two
> treatments (control and stressed).  For each treatment, two biological
> replicates exist, and these are paired (i.e., there is a stressed array for
> colony A and a control array from this same colony).  For one of these
> samples, duplicate arrays were performed (technical replicates).  This works
> out to 18 different arrays corresponding to 12 independant biological
> replicates.  But counting only the biological replicates for each stressor,
> there are only n=2 stressed arrays and n=2 control arrays.
> 
> I am pretty well versed in the analysis of array data using R, but obviously
> this dataset presents a real challenge because of the low replication.  For
> logistical reasons, increasing the sample size is not a possibility.  My
> main goal here is to salvage whatever valid findings can be salvaged from
> the existing data, but I dont want to go too far in claiming significance
> for an expression pattern if there isnt really anystatistical support for
> it.
> 
> My questions are:
> (1) Whether it is even possible to statistically compare the effects of
> these stressors on gene expression,
> (2) If so, what are folks' recomendations?
> (3) Obviously low sample size means low statistical power, but I have always
> been told that calculating variance for n=2 and doing stats on that basis is
> not even mathematically valid.  Can anyone confirm or refute this?

Sample variance is an unbiased estimator of variance for any n>1, so I 
am not sure what you mean by "mathematically valid". It can be quite 
variable though. This is why "moderated t-statistics" [e.g. 1,2] are 
popular in microarray analysis, where the variance estimation in t-like 
statistics is shared across genes, paying a small bias for a big gain in 
precision.

However, you might want to bear in mind that the variability (and hence 
variance) that you observe between replicates within your experiment may 
be a poor representation of the variability that you would see if 
someone independently replicated the experiment. It would not be wise to 
expect statistics to somehow magically extrapolate from one to the 
other. So rather than aiming to "claim significance" you could aim for 
generating biological hypotheses that you could then try to corroborate 
using other means (integrating data from other experiments, literature 
search, followup experiment).

[1] Linear Models and Empirical Bayes Methods for Assessing Differential 
Expression in Microarray Experiments, Gordon K. Smyth
http://www.bepress.com/sagmb/vol3/iss1/art3

[2] Differential Expression with the Bioconductor Project
Anja von Heydebreck, Wolfgang Huber, Robert Gentleman
http://www.bepress.com/bioconductor/paper7

Best wishes
	Wolfgang
-- 
----------------------------------------------------
Wolfgang Huber, EMBL-EBI, http://www.ebi.ac.uk/huber