[BioC] DESeq and length bias [was: to know about the reason in results obtained using DESeq and cufflinks]

Simon Anders anders at embl.de
Mon Aug 30 16:24:00 CEST 2010


Hi

On 08/30/2010 03:03 PM, Aniket Vatsya wrote:
>> Also does DESeq overcome length bias, a general problem in RNA seq data
> analysis?

I don't quite agree with the term "length bias" as it is not really a 
bias in the differential expression analysis.

In RNA-Seq, the number of reads mapped to a gene determines the power 
you have to detect differential expression. See Fig. 2 of our preprint 
(http://precedings.nature.com/documents/4282/version/2) for an 
illustration. For the example data used in this figure, differential 
expression (at 10% FDR) can be detected if the log2 fold change is at 
least around 0.5, if the count values are very high. If you have only 
around 100 counts, the log2 fold change needs to be at least 1, and for 
10 counts, at least 2.

Hence, the power to detect differential expression depends strongly on 
the count, and the count in turn depends on two things, namely (i) the 
expression strength (say, averaged over both conditions) and (ii) the 
gene length (because longer genes give rise to more fragments at the 
same expression level).

In a subsequent analysis looking, e.g., for enrichment in gene 
categories, this causes bias. However, this bias should not and cannot 
be dealt with by the method to test for differential expression. It 
should, however, be taken into account by the enrichment test.

When adjusting such a test, I would suggest to use directly the count 
level as input, and not the transcript length, as the latter is only 
half of the story.

   Simon



More information about the Bioconductor mailing list