[BioC] RNAseq expression analysis using DESeq: technical replicates, paired samples

Tue Nov 8 09:51:44 CET 2011

Hi Michael

On 2011-11-07 23:44, Michael Muratet wrote:
> No, I really was just checking for updates in how to arrange the input.
> I understand the limitations of no biological replicates. However, as
> noted in section 6 of the vignette, '...such experiments are often
> undertaken...' and we hope to confirm what we find by other means.
>
> I thought I had seen documentation somewhere on how to incorporate
> technical replicates into deseq, but I could not put my hands back on
> it, so I thought I'd ask. I have actually processed the original data
> once and got what results we could get. I don't expect the tech
> replicates to change the answer.

Sorry if I jumped a bit at you in my answer, I just was asked this 
question a bit too often recently and maybe I should not write emails 
late in the evening.

In my old post of last year, I refered to the papers of Nagalakshmi et 
al. and Marioni et al., who notices that the variance between technical 
replicates follows a Poisson distribution. Now, if you add up two 
Poisson distributed random variable, the sum is Poisson-distributed, 
too; and this is why any method looking at the technical replicates 
separately will not have more useful information to work with than one 
that is given the sum.

Even if you have overdispersion (i.e., variance in excess of Poisson) 
between technical replicates, summing them up will smooth this out a 
bit, and what is left will simply become part of the variance seen 
between the biological replicates (i.e., the sums of the technical 
ones). So if you have layered experiment with both technical and 
biological, I do not think it is worth the effort to account for this 
hierarchy in the analysis.

Your situation seems to be that you have no biological replicates but 
expect to see only so few good hits that you can follow them up with 
independent, replicated verification. For such cases, I recently 
wondered whether the following trick might help:

Make your best guess for a biological coefficient of variation. Let us 
say, we expect, from experience with previous experiments, that the gene 
expression between replicates, if you had them, would typically vary by 
15% (i.e., 0.15). Then you can inject this dispersion value into your 
DESeq calculation:

   # Start as usual:
   library( DESeq )

   # An example data set with only two samples:
   cds <- makeExampleCountDataSet()[,2:3]

   # Estimate the dispersion, only to throw it away in the next step
   cds <- estimateDispersions( cds, method="blind",
      sharingMode="fit-only" )

   # Overwrite the estimates with our guessed fixed value:
   fData(cds)$disp_blind <- .15^2

   # See what you get
   res <- nbinomTest( cds, "A", "B" )

If you have some hunch what the dispersion might be, you can get some 
preliminary results that way. Do not take the p values too literally, of 
course, as they are based on guesswork. Observe how the ranking of the 
genes and the number of hits changes if you try different dispersion 
guesses. This help you anticipate how much success you can hope for in 
your subsequent effort of verification, and for how many genes you 
should attempt verification.

Some _very_ _rough_ (and maybe completely wrong) values for coefficients 
of variations, as I have seen them in the past:
- Comparison between isogenic liquid yeast cell cultures, highly 
controlled experiment: Below 10% or even 5% (dispersion .1^2 or .05^2) 
if you are lucky
- Comparison between isogenic mammalian cell cultures: maybe around 15% 
if things go well
- Comparison between tissues derived from isogenic litter mates: maybe 
20% to 35%
- Comparison between different non-related individuals: 50% or even much 
more

Best regards
   Simon