[BioC] DESeq and number of replicates required for RNA-Seq

Wed Jun 16 14:54:59 CEST 2010

Mark is right.  If Y~Poisson then sqrt(Y) is approximately normal 
with variance 1/4.

--Naomi

At 07:20 AM 6/15/2010, Mark Robinson wrote:
>Hi Mick.
>
>I can't speak for cufflinks, but the TMM normalization in that GB 
>paper is really about accounting for 'composition' biases.  So, this 
>can help when the samples have different RNA composition (or some 
>other systematic effect), but it seems to me like the "dirtiness" 
>you mention here is just that you have large biological 
>variation.   Genomics studies are generally underpowered anyways and 
>high biological variation, which is presumably a reality of your 
>experimental system, just makes detecting changes harder.
>
>Naomi:  I assume you meant sqrt(Yi), not log(Yi) for the normal 
>approximation to the Possion ?
>
>Cheers,
>Mark
>
>On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote:
>
> > Thanks Naomi
> >
> > Yes, I have several RNA-Seq datasets that look like they may have 
> large biological variation.
> >
> > I feel this is the "dirty secret" of the new revolution that is 
> RNA-Seq - even with large numbers of replicates, the variation in 
> (and nature of) the read counts means we can only find genes that 
> are changing by a large amount.
> >
> > I wonder if some of the normalisation suggested by Robinson and 
> Oshlack will help (http://genomebiology.com/2010/11/3/R25).
> >
> > And of course there is cufflinks
> >
> > Thanks
> > Mick
> > ________________________________________
> > From: Naomi Altman [naomi at stat.psu.edu]
> > Sent: 15 June 2010 03:02
> > To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch
> > Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
> >
> > Hi Michael,
> > I was working this out for a lecture and here is what I found:
> >
> > If there is enough expression for the Normal approximation to hold
> > then here is a rule of thumb.
> >
> > Suppose that the total number of reads is identical for all samples
> > and that there is NO biological variation.  If Yi is the number of
> > reads for a gene in sample i, then
> > Poisson variation alone leads to log(Yi) approx normal with variance
> > 1/4.  (This is what the DESeq vignette calls "shot" variance.)
> >
> > Using the formula for a 2 sample t-test, you see that to detect
> > 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need
> > n>32 var/log(fold) which is approximately 8 biological reps per treatment.
> >
> > However, that is for NO biological variation.  (Have a look at the
> > example in the DESeq vignette!) And is assumes alpha=.05 (but we are
> > going to use a much smaller alpha due to the multiple comparisons
> > adjustment).
> >
> > --Naomi
> >
> >
> > At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote:
> >> Hi Naomi
> >>
> >> Thanks for the reply.
> >>
> >> The issue isn't necessarily low expressing genes, but perhaps high
> >> expressing genes with a small (ish) fold change.  DESeq seems to
> >> only report as significant differences that are high fold changes.
> >>
> >> Contrast this to limma for microarrays, where small fold changes can
> >> be reported as significant.
> >>
> >> For whatever reason, the transcriptomic community have become
> >> fixated on "two-fold" as some kind of standard cut-off.  Now, I'm
> >> not fixated on that, but the example in DESeq reports 428
> >> significant genes with an estimated fold change at FDR 5%, however,
> >> NONE of these are in the range -2 : 2.  The minimum positive logFC
> >> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is
> >> 2.49 (5.65 fold down-regulation).
> >>
> >> So what I am concerned about is finding genes, either highly or
> >> lowly expressed, that are differing by a small fold change - say two-fold.
> >>
> >> Thanks
> >> Mick
> >> ________________________________________
> >> From: Naomi Altman [naomi at stat.psu.edu]
> >> Sent: 14 June 2010 17:42
> >> To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch
> >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
> >>
> >> The issue is a mix of expression level and sample size.  For count
> >> data, the power is higher when the expression is higher.  Also, the
> >> p-values are discrete - the lower the total read count, the fewer
> >> values are possible, which messes up the FDR estimation.
> >>
> >> Of course, understanding the problem does not necessarily suggest a
> >> solution.  But sample sizes will need to be large (or you need to
> >> sequence very deeply) if you want to detect differential expression
> >> in low expressing genes.
> >>
> >> --Naomi
> >>
> >> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote:
> >>> Hi
> >>>
> >>> This follows on slightly from my experimental design thread.
> >>>
> >>> Having worked through the vignette for DESeq, it seems to work
> >>> well.  However, for the TagSeqExample.tab data set, when using an
> >>> FDR cut off of 0.05, what we see is that we only find differential
> >>> expression for large fold changes - an average of log2 fold change
> >>> of 5 for up-regulated, and log2 fold change of -5 for
> >>> down-regulated.  There are very few significant results that even go
> >>> as far down as 2 or -2 - which is still a 4-fold change.
> >>>
> >>> So, the question is, how many replicates must we have to get more
> >>> sensitive results?  Say down to log2FC of 1? (two-fold up or down
> >> regulated)?
> >>>
> >>> I can calculate this by using DESeq's own estimates of variance to
> >>> approximate replicates for T and N in the example data, and keep
> >>> going until my significant results start to hit a logFC of 1, but I
> >>> wanted to know if anyone else had done this yet?
> >>>
> >>> Thanks
> >>> Mick
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at stat.math.ethz.ch
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >> Naomi S. Altman                                814-865-3791 (voice)
> >> Associate Professor
> >> Dept. of Statistics                              814-863-7114 (fax)
> >> Penn State University                         814-865-1348 (Statistics)
> >> University Park, PA 16802-2111
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > Naomi S. Altman                                814-865-3791 (voice)
> > Associate Professor
> > Dept. of Statistics                              814-863-7114 (fax)
> > Penn State University                         814-865-1348 (Statistics)
> > University Park, PA 16802-2111
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>------------------------------
>Mark Robinson, PhD (Melb)
>Epigenetics Laboratory, Garvan
>Bioinformatics Division, WEHI
>e: m.robinson at garvan.org.au
>e: mrobinson at wehi.edu.au
>p: +61 (0)3 9345 2628
>f: +61 (0)3 9347 0852
>------------------------------
>
>
>
>
>
>
>______________________________________________________________________
>The information in this email is confidential and intended solely 
>for the addressee.
>You must not disclose, forward, print or use it without the 
>permission of the sender.
>______________________________________________________________________

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111