[BioC] [Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?

Simon Anders anders at embl.de
Tue May 21 21:47:36 CEST 2013


Hi Thomas

> Is it really so unreasonable to use this type of discrete
> raw expression values (sum of cov/feature) instead of read counts, and
> if so why?

First, the reverse question: What would be the advantage of using the 
coverage per feature over reads? Once you have decided which reads to 
use and where to map them, there is no real difference in difficulty 
between (a) counting how many reads overlap with a given feature and (b) 
adding up the numbers of bases of the feature that are overlapped by 
each read.

Of course, some people might find (b) easier to do than (a) because they 
happen to have a script for (b) lying around and not for (a), but it 
could be as well the other way round, because writing a script is no 
more difficult for (a) than (b). (And actually, neither is trivial: The 
detecting and resolution of ambiguities is not as easy as it sounds, 
especially if features overlap or if paired-end reads are involved.)

BTW: I assume (b) is what you mean by coverage. If not, correct me.

The value (b) may sound slightly nicer as it counts reads only 
fractionally if they overlap the feature only partially. I am not sure 
whether this is really an advantage, though: Conceptually, a read either 
stems from a given gene or it does not. It cannot be that only a part of 
the read derives from a gene, and the other part form some other gene.

The advantage of (a) is that it counts "units of evidence". 
Specifically, we know that the variance of a read count is at least as 
large as the expected value of the count. This is because, conditioned 
on the feature's actual concentration in the sample, read counts are 
always Poisson distributed. Once you marginalize over the 
within-sample-group distribution of concentration, you get some kind of 
overdispersed Poisson, whose variance is strictly larger than the 
expectation.

This gives you for free a lower bound on the variance, which is useful 
to improve specificity of inferential methods. If you do not count reads 
but something else, you do not get this automatic lower bound -- and 
this is the actual reason why so many methods work on read counts rather 
than coverage.

   Simon



More information about the Bioconductor mailing list