[BioC] DESeq and length bias [was: to know about the reason in results obtained using DESeq and cufflinks]
Simon Anders
anders at embl.de
Mon Aug 30 16:24:00 CEST 2010
Hi
On 08/30/2010 03:03 PM, Aniket Vatsya wrote:
>> Also does DESeq overcome length bias, a general problem in RNA seq data
> analysis?
I don't quite agree with the term "length bias" as it is not really a
bias in the differential expression analysis.
In RNA-Seq, the number of reads mapped to a gene determines the power
you have to detect differential expression. See Fig. 2 of our preprint
(http://precedings.nature.com/documents/4282/version/2) for an
illustration. For the example data used in this figure, differential
expression (at 10% FDR) can be detected if the log2 fold change is at
least around 0.5, if the count values are very high. If you have only
around 100 counts, the log2 fold change needs to be at least 1, and for
10 counts, at least 2.
Hence, the power to detect differential expression depends strongly on
the count, and the count in turn depends on two things, namely (i) the
expression strength (say, averaged over both conditions) and (ii) the
gene length (because longer genes give rise to more fragments at the
same expression level).
In a subsequent analysis looking, e.g., for enrichment in gene
categories, this causes bias. However, this bias should not and cannot
be dealt with by the method to test for differential expression. It
should, however, be taken into account by the enrichment test.
When adjusting such a test, I would suggest to use directly the count
level as input, and not the transcript length, as the latter is only
half of the story.
Simon
More information about the Bioconductor
mailing list