[BioC] [Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?
Simon Anders
anders at embl.de
Tue May 21 21:47:36 CEST 2013
Hi Thomas
> Is it really so unreasonable to use this type of discrete
> raw expression values (sum of cov/feature) instead of read counts, and
> if so why?
First, the reverse question: What would be the advantage of using the
coverage per feature over reads? Once you have decided which reads to
use and where to map them, there is no real difference in difficulty
between (a) counting how many reads overlap with a given feature and (b)
adding up the numbers of bases of the feature that are overlapped by
each read.
Of course, some people might find (b) easier to do than (a) because they
happen to have a script for (b) lying around and not for (a), but it
could be as well the other way round, because writing a script is no
more difficult for (a) than (b). (And actually, neither is trivial: The
detecting and resolution of ambiguities is not as easy as it sounds,
especially if features overlap or if paired-end reads are involved.)
BTW: I assume (b) is what you mean by coverage. If not, correct me.
The value (b) may sound slightly nicer as it counts reads only
fractionally if they overlap the feature only partially. I am not sure
whether this is really an advantage, though: Conceptually, a read either
stems from a given gene or it does not. It cannot be that only a part of
the read derives from a gene, and the other part form some other gene.
The advantage of (a) is that it counts "units of evidence".
Specifically, we know that the variance of a read count is at least as
large as the expected value of the count. This is because, conditioned
on the feature's actual concentration in the sample, read counts are
always Poisson distributed. Once you marginalize over the
within-sample-group distribution of concentration, you get some kind of
overdispersed Poisson, whose variance is strictly larger than the
expectation.
This gives you for free a lower bound on the variance, which is useful
to improve specificity of inferential methods. If you do not count reads
but something else, you do not get this automatic lower bound -- and
this is the actual reason why so many methods work on read counts rather
than coverage.
Simon
More information about the Bioconductor
mailing list