[R] Analysis of poorly replicated array data
Wolfgang Huber
huber at ebi.ac.uk
Mon Jul 14 22:23:08 CEST 2008
Hi Eli
> I have "inherited" a cDNA macroarray dataset that is structured as follows.
> Three different stressors were tested. For each stressor, there are two
> treatments (control and stressed). For each treatment, two biological
> replicates exist, and these are paired (i.e., there is a stressed array for
> colony A and a control array from this same colony). For one of these
> samples, duplicate arrays were performed (technical replicates). This works
> out to 18 different arrays corresponding to 12 independant biological
> replicates. But counting only the biological replicates for each stressor,
> there are only n=2 stressed arrays and n=2 control arrays.
>
> I am pretty well versed in the analysis of array data using R, but obviously
> this dataset presents a real challenge because of the low replication. For
> logistical reasons, increasing the sample size is not a possibility. My
> main goal here is to salvage whatever valid findings can be salvaged from
> the existing data, but I dont want to go too far in claiming significance
> for an expression pattern if there isnt really anystatistical support for
> it.
>
> My questions are:
> (1) Whether it is even possible to statistically compare the effects of
> these stressors on gene expression,
> (2) If so, what are folks' recomendations?
> (3) Obviously low sample size means low statistical power, but I have always
> been told that calculating variance for n=2 and doing stats on that basis is
> not even mathematically valid. Can anyone confirm or refute this?
Sample variance is an unbiased estimator of variance for any n>1, so I
am not sure what you mean by "mathematically valid". It can be quite
variable though. This is why "moderated t-statistics" [e.g. 1,2] are
popular in microarray analysis, where the variance estimation in t-like
statistics is shared across genes, paying a small bias for a big gain in
precision.
However, you might want to bear in mind that the variability (and hence
variance) that you observe between replicates within your experiment may
be a poor representation of the variability that you would see if
someone independently replicated the experiment. It would not be wise to
expect statistics to somehow magically extrapolate from one to the
other. So rather than aiming to "claim significance" you could aim for
generating biological hypotheses that you could then try to corroborate
using other means (integrating data from other experiments, literature
search, followup experiment).
[1] Linear Models and Empirical Bayes Methods for Assessing Differential
Expression in Microarray Experiments, Gordon K. Smyth
http://www.bepress.com/sagmb/vol3/iss1/art3
[2] Differential Expression with the Bioconductor Project
Anja von Heydebreck, Wolfgang Huber, Robert Gentleman
http://www.bepress.com/bioconductor/paper7
Best wishes
Wolfgang
--
----------------------------------------------------
Wolfgang Huber, EMBL-EBI, http://www.ebi.ac.uk/huber
More information about the R-help
mailing list