[BioC] a few questions on DESeq

Thu Nov 11 11:44:52 CET 2010

Dear Sunghee

On 11/11/2010 05:44 AM, sunghee OH wrote:
> In DESeq, if one data set has both technical and biological replicates, i am
> curious how technical and biological replicates are distinguished in that
> package. for instance, say, each biological replicate has two technical
> replicates. T_b_1_t_1(cancer biological replicate 1, technical replicate1)
> T_b_1_t_2,  T_b_2_t_1,  T_b_2_t_2  VS similarly Normal samples.
 >
> So. the main null hypothesis is no difference between tumor vs normal
> group.(4 samples vs 4)
>
> For such a case, when analyzing 8 samples together with the above
> hypothesis,
>
> Q1. for "conds" variable indicating the labels of samples mentioned in DESeq
> manual, how do i have to define to distinguish tech vs biological
> replicates?(edgeR can not do for such a data, i know that it is able to
> treat only data with technical replicates rather than biological individuals
> instead, yet, it seems DESeq is able to handle data set with biological
> replicates without any additional variable of biological individual factor
> as like GLM. In the paper, one of published data sets for such kind of type
> data which has both tech and bio replicates was also analyzed: RNA-Seq of
> yeast as real data application using DESeq)

The short answer is: Just add up the counts from your technical 
replicates, and treat them as one biological sample.

If you want a justification for this, I have to elaborate quite a bit. 
Hence, I take the opportunity to write a little treatise on noise in 
RNA-Seq, as this is something I wanted to write down anyway.

Let's first fix some terminology. I divide the noise (or: variance) in 
our sample into three parts, shot noise, technical noise and biological 
noise.

Shot noise: Assume, you prepare a library including amplification etc 
and fill the prepared library in two lanes of a Solexa flowcell. Then, 
we may assume that the concentration of each transcript is exactly the 
same in the two lanes. However, we will nevertheless get different 
counts for the transcript from the two lanes because high-throughput 
sequencing is (as is all shotgun sequencing) a probabilistic process: 
each cDNA molecule has a certain probability to be read by the 
sequencer, and hence, the number of observed reads follows a Poisson 
distribution. The variance of a Poisson distribution is equal to its 
mean. From this, we can conclude that the theoretical minimum to the 
variance in an HTS experiment (i.e., the noise occurring even when 
comparing perfectly identical prepared libraries) is equal to the 
expected count rate.

The actual variance is larger than the shot noise, as two further noise 
sources need to be taken into account.

Technical noise: By this, I mean the amount of variance in addition to 
shot noise that we observe when comparing two technical replicates, 
i.e., two libraries prepared from the same sample. This noise captures 
the non-uniformity of the mRNA extraction and library preparation 
processes (and any noise in the sequencing process itself beyond the 
shot noise).

Biological noise: This is variation that you observe in addition to shot 
noise and technical noise, when comparing two biological replicates, 
i.e., libraries prepared from two samples which were treated in the same 
way. It reflects how two different samples which the experimenter 
attempts to treat identical, nevertheless differ in their response to 
environmental circumstances that the experimenter could not control for.

Hence, the total variance that one can observe from comparing two 
samples is:

Total variance = Shot noise + technical noise + biological noise

If you compare technical replicates, the biological noise is zero, of 
course.

The first thing to note in this decomposition is that we know the amount 
of shot noise a priori from theoretical grounds (it is well estimated as 
equal to the observed count rate), while the other two have to be 
estimated from the data, as they reflect properties of the specific 
experiment.

The crucial point is that technical noise is usually _much_ smaller than 
shot noise and biological noise, so small, in fact, that it is not worth 
treating it separately, rather than just lumping it together with either 
the shot noise or the biological noise. This fact is the core result of 
Marioni et al.'s 2008 Genome Res. paper and was also described by 
Nagalakshmi et al. in their 2008 Science paper.

This holds only, of course, if everything went well in library 
preparation. So, if you have technical replicates, you should take the 
opportunity to check this. To do so with DESeq, set the 'conditions' 
factor of your CountDataSet such that each sample in a pair of technical 
replicates gets the same levels, but biological replicates get different 
levels, i.e., let DESeq know which samples are technical replicates but 
let it treat biological replicates as different conditions. Then, 
estimate the size factors and variance functions and call 'scvPlot'. If 
technical noise is really negligible compared to shot noise, the 
distance of the solid lines (total variance minus shot noise) to the x 
axis should appear very small when compared to the distance between the 
dashed and the solid line (which is the shot noise), at least in the 
count rate region where most genes are (as indicated by the black line). 
If it does not look like this, you may have issues with your sample prep.

Once you have convinced yourself that the technical replicates agree as 
well as the shot noise predicts, there is no longer any point in keeping 
them separate. Just add up the counts from all technical replicates of a 
given biological replicate to form a single column in the count table 
and make a new CountDataSet, with the 'conditions' factor now describing 
the biological conditions of the samples. If you now look at the SCV 
plot, the distance between solid lines and x axis indicates the sum of 
biological and technical noise. If technical noise has before been 
established as negligible, you can read it as only showing the 
biological variation. A good way to interpret it, is, by the way, to 
take the square root of the value and understand it as the coefficient 
of variation of gene expression over biological replicates.
The distance between solid and dashed line is the shot noise, as before.

If your technical noise is not negligible, one might argue that you need 
a hierarchical noise model to get optimal power. To my knowledge, nobody 
has implemented something like this yet for RNA-Seq. Furthermore, I 
think, it would still be permissible to add up the technical replicates 
because the technical noise will still be apparent in the sum and hence 
properly accounted for once you estimate total variance between the 
summed-up biological replicates.

Hence, my initial short answer: Just sum up the count from technical 
replicates.

This was a long mail, maybe even too long to be read by anyone, so I 
stop here and reply to your other two questions later today.

Cheers
   Simon

+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de