I guess the issue I am trying to highlight is one that Naomi put into a very good statistical framework – with ZERO biological variability, one would still need 8 replicates per treatment to detect a significant two-fold difference.  Isn’t that slightly worrisome?  Don’t get me wrong, I love the potential of RNA-Seq, have some RNA-Seq data and plan to produce a hell of a lot more, but the analysis side seems to lack the power to detect small changes.

Naomi Altman said:

“Suppose that the total number of reads is identical for all samples

and that there is NO biological variation.  If Yi is the number of

reads for a gene in sample i, then

Poisson variation alone leads to log(Yi) approx normal with variance

1/4.  (This is what the DESeq vignette calls "shot" variance.)



Using the formula for a 2 sample t-test, you see that to detect

2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need

n>32 var/log(fold) which is approximately 8 biological reps per treatment.”





From: seandavi@gmail.com [mailto:seandavi@gmail.com] On Behalf Of Sean Davis
Sent: 15 June 2010 17:06
To: michael watson (IAH-C)
Cc: Cei Abreu-Goodger; bioc list
Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq


On Tue, Jun 15, 2010 at 11:18 AM, michael watson (IAH-C) <michael.watson@bbsrc.ac.uk<mailto:michael.watson@bbsrc.ac.uk>> wrote:
Hi Cei

These are not my datasets, and I have heard something similar to you.

The history of this discussion is that I asked why DESeq, in the example data anaylsis, only found as significant genes whose fold change was very large (2^5 on average).  I thought this was odd as low fold changes can be significant.  Naomi replied and stated this was due to high biological variation.

That is certainly one possibility.  I think that she also suggested that both sequencing depth and sample numbers also contribute to the ability to find DE genes.  As a corollary to the sequencing depth issue, it will be true that the fold change defined as "significant" is a function of the sequencing depth, so for high-expressing genes, it is easier to find DE with smaller fold changes than with lower expressed genes.  In other words, a two-fold change may be meaningless for a large chunk of your genes of interest if the expression level/sequencing depth is not large enough while for high-expressing genes, a two-fold change might be highly significant.  While this was somewhat true for expression level in microarrays, the expression level and ability to detect differential expression with sequencing data takes on a new level of importance, perhaps.

Sean


Perhaps the data in the DESeq example are different to other RNA-Seq datasets - if anyone else has applied DESeq to other datasets and found significant genes with a low fold-change, I would be interested to hear it

Thanks
Mick

________________________________________
From: Cei Abreu-Goodger [cei@ebi.ac.uk<mailto:cei@ebi.ac.uk>]
Sent: 15 June 2010 16:12
To: michael watson (IAH-C)
Cc: bioc list
Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq

Hi Mick,

Do you know how much PCR amplification was used for your RNA-Seq
datasets? Could this be one factor leading to more apparent variation?

I'm surprised at your results though, I seem to remember many papers
with dot-plots comparing expression measurements between biological
replicates, which show that the RNA-Seq derived results are much tighter
(less variation) than the array based ones.

Cheers,

Cei

michael watson (IAH-C) wrote:
> Hi Mark
>
> Thanks for the reply.  I drafted and sent en e-mail on my phone which then lost connectivity (and the message) so apologies if anyone gets something similar twice!
>
> What worries me here is that we are looking at gene expression; we have been for years, using technologies such as RT-PCR, microarrays and SAGE.  Using those technologies, we have been able to detect small but significant changes.  Now, with RNA-Seq, we are seeing huge variation, and we simply state that due to this huge biological variation, we can only detect large changes.  Now, either these systems do actually contain huge biological variation, which means that many of our RT-PCR and array work was wrong (as these didn't detect such huge biological variation), or the variation we are seeing is not all biological and could be due to some bias we are not yet aware of.
>
> In fact, both SAGE and RT-PCR rely on count data of some sort, yet again neither showed up so much biological variation that they could only detect large changes.
>
> I suspect we haven't gotten to the bottom of RNA-Seq data just yet.
>
> Mick
>
> ________________________________________
> From: Mark Robinson [mrobinson@wehi.EDU.AU<mailto:mrobinson@wehi.EDU.AU>]
> Sent: 15 June 2010 12:20
> To: michael watson (IAH-C)
> Cc: bioc list; Naomi Altman
> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>
> Hi Mick.
>
> I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases.  So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation.   Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder.
>
> Naomi:  I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ?
>
> Cheers,
> Mark
>
> On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote:
>
>> Thanks Naomi
>>
>> Yes, I have several RNA-Seq datasets that look like they may have large biological variation.
>>
>> I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount.
>>
>> I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25).
>>
>> And of course there is cufflinks
>>
>> Thanks
>> Mick
>> ________________________________________
>> From: Naomi Altman [naomi@stat.psu.edu<mailto:naomi@stat.psu.edu>]
>> Sent: 15 June 2010 03:02
>> To: michael watson (IAH-C); Naomi Altman; bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch>
>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>>
>> Hi Michael,
>> I was working this out for a lecture and here is what I found:
>>
>> If there is enough expression for the Normal approximation to hold
>> then here is a rule of thumb.
>>
>> Suppose that the total number of reads is identical for all samples
>> and that there is NO biological variation.  If Yi is the number of
>> reads for a gene in sample i, then
>> Poisson variation alone leads to log(Yi) approx normal with variance
>> 1/4.  (This is what the DESeq vignette calls "shot" variance.)
>>
>> Using the formula for a 2 sample t-test, you see that to detect
>> 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need
>> n>32 var/log(fold) which is approximately 8 biological reps per treatment.
>>
>> However, that is for NO biological variation.  (Have a look at the
>> example in the DESeq vignette!) And is assumes alpha=.05 (but we are
>> going to use a much smaller alpha due to the multiple comparisons
>> adjustment).
>>
>> --Naomi
>>
>>
>> At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote:
>>> Hi Naomi
>>>
>>> Thanks for the reply.
>>>
>>> The issue isn't necessarily low expressing genes, but perhaps high
>>> expressing genes with a small (ish) fold change.  DESeq seems to
>>> only report as significant differences that are high fold changes.
>>>
>>> Contrast this to limma for microarrays, where small fold changes can
>>> be reported as significant.
>>>
>>> For whatever reason, the transcriptomic community have become
>>> fixated on "two-fold" as some kind of standard cut-off.  Now, I'm
>>> not fixated on that, but the example in DESeq reports 428
>>> significant genes with an estimated fold change at FDR 5%, however,
>>> NONE of these are in the range -2 : 2.  The minimum positive logFC
>>> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is
>>> 2.49 (5.65 fold down-regulation).
>>>
>>> So what I am concerned about is finding genes, either highly or
>>> lowly expressed, that are differing by a small fold change - say two-fold.
>>>
>>> Thanks
>>> Mick
>>> ________________________________________
>>> From: Naomi Altman [naomi@stat.psu.edu<mailto:naomi@stat.psu.edu>]
>>> Sent: 14 June 2010 17:42
>>> To: michael watson (IAH-C); bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch>
>>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq
>>>
>>> The issue is a mix of expression level and sample size.  For count
>>> data, the power is higher when the expression is higher.  Also, the
>>> p-values are discrete - the lower the total read count, the fewer
>>> values are possible, which messes up the FDR estimation.
>>>
>>> Of course, understanding the problem does not necessarily suggest a
>>> solution.  But sample sizes will need to be large (or you need to
>>> sequence very deeply) if you want to detect differential expression
>>> in low expressing genes.
>>>
>>> --Naomi
>>>
>>> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote:
>>>> Hi
>>>>
>>>> This follows on slightly from my experimental design thread.
>>>>
>>>> Having worked through the vignette for DESeq, it seems to work
>>>> well.  However, for the TagSeqExample.tab data set, when using an
>>>> FDR cut off of 0.05, what we see is that we only find differential
>>>> expression for large fold changes - an average of log2 fold change
>>>> of 5 for up-regulated, and log2 fold change of -5 for
>>>> down-regulated.  There are very few significant results that even go
>>>> as far down as 2 or -2 - which is still a 4-fold change.
>>>>
>>>> So, the question is, how many replicates must we have to get more
>>>> sensitive results?  Say down to log2FC of 1? (two-fold up or down
>>> regulated)?
>>>> I can calculate this by using DESeq's own estimates of variance to
>>>> approximate replicates for T and N in the example data, and keep
>>>> going until my significant results start to hit a logFC of 1, but I
>>>> wanted to know if anyone else had done this yet?
>>>>
>>>> Thanks
>>>> Mick
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor@stat.math.ethz.ch<mailto:Bioconductor@stat.math.ethz.ch>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> Naomi S. Altman                                814-865-3791 (voice)
>>> Associate Professor
>>> Dept. of Statistics                              814-863-7114 (fax)
>>> Penn State University                         814-865-1348 (Statistics)
>>> University Park, PA 16802-2111
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor@stat.math.ethz.ch<mailto:Bioconductor@stat.math.ethz.ch>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> Naomi S. Altman                                814-865-3791 (voice)
>> Associate Professor
>> Dept. of Statistics                              814-863-7114 (fax)
>> Penn State University                         814-865-1348 (Statistics)
>> University Park, PA 16802-2111
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@stat.math.ethz.ch<mailto:Bioconductor@stat.math.ethz.ch>
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> ------------------------------
> Mark Robinson, PhD (Melb)
> Epigenetics Laboratory, Garvan
> Bioinformatics Division, WEHI
> e: m.robinson@garvan.org.au<mailto:m.robinson@garvan.org.au>
> e: mrobinson@wehi.edu.au<mailto:mrobinson@wehi.edu.au>
> p: +61 (0)3 9345 2628
> f: +61 (0)3 9347 0852
> ------------------------------
>
>
>
>
>
>
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:6}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch<mailto:Bioconductor@stat.math.ethz.ch>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
_______________________________________________
Bioconductor mailing list
Bioconductor@stat.math.ethz.ch<mailto:Bioconductor@stat.math.ethz.ch>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


	[[alternative HTML version deleted]]