[BioC] [Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?
Thomas Girke
thomas.girke at ucr.edu
Tue May 21 22:16:56 CEST 2013
Thanks. Your explanation makes sense. I really had to bring this up
(perhaps should have used new email thread) since it appears to be
such a basic question to which I didn't have a convincing answer.
Thanks Simon and others for taking the time responding to this almost
"philosophical" questions: "the meaning of read counting" :). I appreciate
it.
Thomas
On Tue, May 21, 2013 at 07:47:36PM +0000, Simon Anders wrote:
> Hi Thomas
>
> > Is it really so unreasonable to use this type of discrete
> > raw expression values (sum of cov/feature) instead of read counts, and
> > if so why?
>
> First, the reverse question: What would be the advantage of using the
> coverage per feature over reads? Once you have decided which reads to
> use and where to map them, there is no real difference in difficulty
> between (a) counting how many reads overlap with a given feature and (b)
> adding up the numbers of bases of the feature that are overlapped by
> each read.
>
> Of course, some people might find (b) easier to do than (a) because they
> happen to have a script for (b) lying around and not for (a), but it
> could be as well the other way round, because writing a script is no
> more difficult for (a) than (b). (And actually, neither is trivial: The
> detecting and resolution of ambiguities is not as easy as it sounds,
> especially if features overlap or if paired-end reads are involved.)
>
> BTW: I assume (b) is what you mean by coverage. If not, correct me.
>
> The value (b) may sound slightly nicer as it counts reads only
> fractionally if they overlap the feature only partially. I am not sure
> whether this is really an advantage, though: Conceptually, a read either
> stems from a given gene or it does not. It cannot be that only a part of
> the read derives from a gene, and the other part form some other gene.
>
> The advantage of (a) is that it counts "units of evidence".
> Specifically, we know that the variance of a read count is at least as
> large as the expected value of the count. This is because, conditioned
> on the feature's actual concentration in the sample, read counts are
> always Poisson distributed. Once you marginalize over the
> within-sample-group distribution of concentration, you get some kind of
> overdispersed Poisson, whose variance is strictly larger than the
> expectation.
>
> This gives you for free a lower bound on the variance, which is useful
> to improve specificity of inferential methods. If you do not count reads
> but something else, you do not get this automatic lower bound -- and
> this is the actual reason why so many methods work on read counts rather
> than coverage.
>
> Simon
>
More information about the Bioconductor
mailing list