[BioC] a few questions on DESeq
Simon Anders
anders at embl.de
Thu Nov 11 11:44:52 CET 2010
Dear Sunghee
On 11/11/2010 05:44 AM, sunghee OH wrote:
> In DESeq, if one data set has both technical and biological replicates, i am
> curious how technical and biological replicates are distinguished in that
> package. for instance, say, each biological replicate has two technical
> replicates. T_b_1_t_1(cancer biological replicate 1, technical replicate1)
> T_b_1_t_2, T_b_2_t_1, T_b_2_t_2 VS similarly Normal samples.
>
> So. the main null hypothesis is no difference between tumor vs normal
> group.(4 samples vs 4)
>
> For such a case, when analyzing 8 samples together with the above
> hypothesis,
>
> Q1. for "conds" variable indicating the labels of samples mentioned in DESeq
> manual, how do i have to define to distinguish tech vs biological
> replicates?(edgeR can not do for such a data, i know that it is able to
> treat only data with technical replicates rather than biological individuals
> instead, yet, it seems DESeq is able to handle data set with biological
> replicates without any additional variable of biological individual factor
> as like GLM. In the paper, one of published data sets for such kind of type
> data which has both tech and bio replicates was also analyzed: RNA-Seq of
> yeast as real data application using DESeq)
The short answer is: Just add up the counts from your technical
replicates, and treat them as one biological sample.
If you want a justification for this, I have to elaborate quite a bit.
Hence, I take the opportunity to write a little treatise on noise in
RNA-Seq, as this is something I wanted to write down anyway.
Let's first fix some terminology. I divide the noise (or: variance) in
our sample into three parts, shot noise, technical noise and biological
noise.
Shot noise: Assume, you prepare a library including amplification etc
and fill the prepared library in two lanes of a Solexa flowcell. Then,
we may assume that the concentration of each transcript is exactly the
same in the two lanes. However, we will nevertheless get different
counts for the transcript from the two lanes because high-throughput
sequencing is (as is all shotgun sequencing) a probabilistic process:
each cDNA molecule has a certain probability to be read by the
sequencer, and hence, the number of observed reads follows a Poisson
distribution. The variance of a Poisson distribution is equal to its
mean. From this, we can conclude that the theoretical minimum to the
variance in an HTS experiment (i.e., the noise occurring even when
comparing perfectly identical prepared libraries) is equal to the
expected count rate.
The actual variance is larger than the shot noise, as two further noise
sources need to be taken into account.
Technical noise: By this, I mean the amount of variance in addition to
shot noise that we observe when comparing two technical replicates,
i.e., two libraries prepared from the same sample. This noise captures
the non-uniformity of the mRNA extraction and library preparation
processes (and any noise in the sequencing process itself beyond the
shot noise).
Biological noise: This is variation that you observe in addition to shot
noise and technical noise, when comparing two biological replicates,
i.e., libraries prepared from two samples which were treated in the same
way. It reflects how two different samples which the experimenter
attempts to treat identical, nevertheless differ in their response to
environmental circumstances that the experimenter could not control for.
Hence, the total variance that one can observe from comparing two
samples is:
Total variance = Shot noise + technical noise + biological noise
If you compare technical replicates, the biological noise is zero, of
course.
The first thing to note in this decomposition is that we know the amount
of shot noise a priori from theoretical grounds (it is well estimated as
equal to the observed count rate), while the other two have to be
estimated from the data, as they reflect properties of the specific
experiment.
The crucial point is that technical noise is usually _much_ smaller than
shot noise and biological noise, so small, in fact, that it is not worth
treating it separately, rather than just lumping it together with either
the shot noise or the biological noise. This fact is the core result of
Marioni et al.'s 2008 Genome Res. paper and was also described by
Nagalakshmi et al. in their 2008 Science paper.
This holds only, of course, if everything went well in library
preparation. So, if you have technical replicates, you should take the
opportunity to check this. To do so with DESeq, set the 'conditions'
factor of your CountDataSet such that each sample in a pair of technical
replicates gets the same levels, but biological replicates get different
levels, i.e., let DESeq know which samples are technical replicates but
let it treat biological replicates as different conditions. Then,
estimate the size factors and variance functions and call 'scvPlot'. If
technical noise is really negligible compared to shot noise, the
distance of the solid lines (total variance minus shot noise) to the x
axis should appear very small when compared to the distance between the
dashed and the solid line (which is the shot noise), at least in the
count rate region where most genes are (as indicated by the black line).
If it does not look like this, you may have issues with your sample prep.
Once you have convinced yourself that the technical replicates agree as
well as the shot noise predicts, there is no longer any point in keeping
them separate. Just add up the counts from all technical replicates of a
given biological replicate to form a single column in the count table
and make a new CountDataSet, with the 'conditions' factor now describing
the biological conditions of the samples. If you now look at the SCV
plot, the distance between solid lines and x axis indicates the sum of
biological and technical noise. If technical noise has before been
established as negligible, you can read it as only showing the
biological variation. A good way to interpret it, is, by the way, to
take the square root of the value and understand it as the coefficient
of variation of gene expression over biological replicates.
The distance between solid and dashed line is the shot noise, as before.
If your technical noise is not negligible, one might argue that you need
a hierarchical noise model to get optimal power. To my knowledge, nobody
has implemented something like this yet for RNA-Seq. Furthermore, I
think, it would still be permissible to add up the technical replicates
because the technical noise will still be apparent in the sum and hence
properly accounted for once you estimate total variance between the
summed-up biological replicates.
Hence, my initial short answer: Just sum up the count from technical
replicates.
This was a long mail, maybe even too long to be read by anyone, so I
stop here and reply to your other two questions later today.
Cheers
Simon
+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de
More information about the Bioconductor
mailing list