[BioC] [Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?
Simon Anders
anders at embl.de
Tue May 21 21:01:39 CEST 2013
Hi Thomas
On 21/05/13 17:49, Thomas Girke wrote:
> Because of these complications, I am sometimes wondering whether one
> couldn't use for many RNA-Seq use cases coverage values (e.g. mean
> coverage) as raw expression measure instead of read counts. Has anyone
> systematically tested whether this would be a suitable approach for the
> downstream DEG analysis? Right now everyone believes RNA-Seq analysis
> requires read counting, but honestly I don't see why that is. Perhaps
> the benefits of this are so minor that it is not worth dealing with a
> different type of expression measure.
The "complications" I had in mind apply to mean coverage as well as to
reads.
Actually, at least in my personal opinion, it is quite clear which rules
one should stick to when obtaining the count table, and therefore,
obtaining a correct count table is not at all complicated if one thinks
carefully enough about it.
The main issue is this: Imagine, a number of reads can be mapped to two
distinct genes, either because the genes overlap, or, more commonly,
because the genes share repetitive sequence. If you count your reads for
both genes then both genes will appear differentially expressed even if
only one gene actually is. Hence, one must discard reads that cannot be
unambiguously assigned to one gene. Of course, genes which lose reads in
this manner will have to low expression estimates, i.e., you incur a
negative bias. However, this bias cancels out when comparing the same
gene across samples, i.e., it is not a reason for concern in
differential expression testing.
Consequently, a method which aims at getting _unbiased_ point estimates
of expression strength is typically unsuitable as input for strength for
differential expression testing. The point that counting for expression
estimation and for DE testing are different tasks is subtle and often
overlooked. Mistakes arising from this can cause strange effects.
A particularly common mistake is to map the reads to transcripts, not
genes, or to count overlaps not with annotated genes but with annotated
transcripts. Of course, most reads will map to several transcripts
because most transcript isoforms of a given gene overlap heavily. If one
counts reads mapping to multiple transcripts for each of these, one gets
severe artifacts in downstream analysis, and if one discards multiply
mapping reads one loses most of the genes.
I do not know what the original poster meant when she said that she
counted for UCSC transcripts (rather than genes) but if she took care to
avoid the common pitfalls just described, she must have employed a quite
sophisticated construction. This is why my first question was how she
obtained the counts.
> Right now everyone believes RNA-Seq analysis
> requires read counting, but honestly I don't see why that is.
Just to clarify this: I don't think that this is universally believed.
It is only that a method is typically designed for a specific type of
data, and several commonly used Bioconductor packages for RNA-Seq
analysis, namely edgeR, DESeq, BaySeq, DSS, expect count data, because
the specific statistical methods used in these packages are build on the
assumption that they get read counts. Obviously, they will not give
correct results if they are given any other kind of data. This is stated
quite clearly in the vignettes, but nevertheless people keep asking
whether they can supply arbitrary other kind of data such as rounded
coverage values. And this insistence in using methods in a manner
clearly violating instructions is, quite frankly, a bit frustrating. So
when you observed that people like me seem to be quite insistent on the
importance of using counts this may have been with respect to these very
common question, which are of course specific to certain methods.
Simon
More information about the Bioconductor
mailing list