[BioC] [Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?

Thomas Girke thomas.girke at ucr.edu
Tue May 21 22:16:56 CEST 2013


Thanks. Your explanation makes sense. I really had to bring this up
(perhaps should have used new email thread) since it appears to be
such a basic question to which I didn't have a convincing answer.

Thanks Simon and others for taking the time responding to this almost
"philosophical" questions: "the meaning of read counting" :). I appreciate
it.

Thomas

On Tue, May 21, 2013 at 07:47:36PM +0000, Simon Anders wrote:
> Hi Thomas
> 
> > Is it really so unreasonable to use this type of discrete
> > raw expression values (sum of cov/feature) instead of read counts, and
> > if so why?
> 
> First, the reverse question: What would be the advantage of using the 
> coverage per feature over reads? Once you have decided which reads to 
> use and where to map them, there is no real difference in difficulty 
> between (a) counting how many reads overlap with a given feature and (b) 
> adding up the numbers of bases of the feature that are overlapped by 
> each read.
> 
> Of course, some people might find (b) easier to do than (a) because they 
> happen to have a script for (b) lying around and not for (a), but it 
> could be as well the other way round, because writing a script is no 
> more difficult for (a) than (b). (And actually, neither is trivial: The 
> detecting and resolution of ambiguities is not as easy as it sounds, 
> especially if features overlap or if paired-end reads are involved.)
> 
> BTW: I assume (b) is what you mean by coverage. If not, correct me.
> 
> The value (b) may sound slightly nicer as it counts reads only 
> fractionally if they overlap the feature only partially. I am not sure 
> whether this is really an advantage, though: Conceptually, a read either 
> stems from a given gene or it does not. It cannot be that only a part of 
> the read derives from a gene, and the other part form some other gene.
> 
> The advantage of (a) is that it counts "units of evidence". 
> Specifically, we know that the variance of a read count is at least as 
> large as the expected value of the count. This is because, conditioned 
> on the feature's actual concentration in the sample, read counts are 
> always Poisson distributed. Once you marginalize over the 
> within-sample-group distribution of concentration, you get some kind of 
> overdispersed Poisson, whose variance is strictly larger than the 
> expectation.
> 
> This gives you for free a lower bound on the variance, which is useful 
> to improve specificity of inferential methods. If you do not count reads 
> but something else, you do not get this automatic lower bound -- and 
> this is the actual reason why so many methods work on read counts rather 
> than coverage.
> 
>    Simon
>



More information about the Bioconductor mailing list