[BioC] [Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?

Yuan Hao yuan.x.hao at gmail.com
Tue May 21 22:20:40 CEST 2013


Hi Simon,

You probed the point very clear. This makes me think about something related but not quite sure myself. In practice, we sometimes take a fraction of a reads mapped to multiple features especially in the case of transposons. If a read mapped to m different features, we counted 1/m for a single feature. This somehow breaks your 'units of evidence' rule. If we still would like to preserve the advantage of a smaller variance, do you think it's reasonable to always normalized the counts based on unique mappers, even the counts originated from multiple mappers?

Cheers,
Yuan


On May 21, 2013, at 3:47 PM, Simon Anders <anders at embl.de> wrote:

> Hi Thomas
> 
>> Is it really so unreasonable to use this type of discrete
>> raw expression values (sum of cov/feature) instead of read counts, and
>> if so why?
> 
> First, the reverse question: What would be the advantage of using the coverage per feature over reads? Once you have decided which reads to use and where to map them, there is no real difference in difficulty between (a) counting how many reads overlap with a given feature and (b) adding up the numbers of bases of the feature that are overlapped by each read.
> 
> Of course, some people might find (b) easier to do than (a) because they happen to have a script for (b) lying around and not for (a), but it could be as well the other way round, because writing a script is no more difficult for (a) than (b). (And actually, neither is trivial: The detecting and resolution of ambiguities is not as easy as it sounds, especially if features overlap or if paired-end reads are involved.)
> 
> BTW: I assume (b) is what you mean by coverage. If not, correct me.
> 
> The value (b) may sound slightly nicer as it counts reads only fractionally if they overlap the feature only partially. I am not sure whether this is really an advantage, though: Conceptually, a read either stems from a given gene or it does not. It cannot be that only a part of the read derives from a gene, and the other part form some other gene.
> 
> The advantage of (a) is that it counts "units of evidence". Specifically, we know that the variance of a read count is at least as large as the expected value of the count. This is because, conditioned on the feature's actual concentration in the sample, read counts are always Poisson distributed. Once you marginalize over the within-sample-group distribution of concentration, you get some kind of overdispersed Poisson, whose variance is strictly larger than the expectation.
> 
> This gives you for free a lower bound on the variance, which is useful to improve specificity of inferential methods. If you do not count reads but something else, you do not get this automatic lower bound -- and this is the actual reason why so many methods work on read counts rather than coverage.
> 
>  Simon
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list