[BioC] [Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?

Thomas Girke thomas.girke at ucr.edu
Tue May 21 21:30:25 CEST 2013


Hi Simon,

I totally agree with your statements on the importance of generating
appropriate read counting tables. In my email I should have used as an
example "sum of coverage" per feature rather than mean coverage since I
don't want to direct the discussion into an area that has been discussed
many times. Is it really so unreasonable to use this type of discrete
raw expression values (sum of cov/feature) instead of read counts, and
if so why?

Thomas


On Tue, May 21, 2013 at 07:01:39PM +0000, Simon Anders wrote:
> Hi Thomas
> 
> On 21/05/13 17:49, Thomas Girke wrote:
> > Because of these complications, I am sometimes wondering whether one
> > couldn't use for many RNA-Seq use cases coverage values (e.g. mean
> > coverage) as raw expression measure instead of read counts. Has anyone
> > systematically tested whether this would be a suitable approach for the
> > downstream DEG analysis? Right now everyone believes RNA-Seq analysis
> > requires read counting, but honestly I don't see why that is. Perhaps
> > the benefits of this are so minor that it is not worth dealing with a
> > different type of expression measure.
> 
> The "complications" I had in mind apply to mean coverage as well as to 
> reads.
> 
> Actually, at least in my personal opinion, it is quite clear which rules 
> one should stick to when obtaining the count table, and therefore, 
> obtaining a correct count table is not at all complicated if one thinks 
> carefully enough about it.
> 
> The main issue is this: Imagine, a number of reads can be mapped to two 
> distinct genes, either because the genes overlap, or, more commonly, 
> because the genes share repetitive sequence. If you count your reads for 
> both genes then both genes will appear differentially expressed even if 
> only one gene actually is. Hence, one must discard reads that  cannot be 
> unambiguously assigned to one gene. Of course, genes which lose reads in 
> this manner will have to low expression estimates, i.e., you incur a 
> negative bias. However, this bias cancels out when comparing the same 
> gene across samples, i.e., it is not a reason for concern in 
> differential expression testing.
> 
> Consequently, a method which aims at getting _unbiased_ point estimates 
> of expression strength is typically unsuitable as input for strength for 
> differential expression testing. The point that counting for expression 
> estimation and for DE testing are different tasks is subtle and often 
> overlooked. Mistakes arising from this can cause strange effects.
> 
> A particularly common mistake is to map the reads to transcripts, not 
> genes, or to count overlaps not with annotated genes but with annotated 
> transcripts. Of course, most reads will map to several transcripts 
> because most transcript isoforms of a given gene overlap heavily. If one 
> counts reads mapping to multiple transcripts for each of these, one gets 
> severe artifacts in downstream analysis, and if one discards multiply 
> mapping reads one loses most of the genes.
> 
> I do not know what the original poster meant when she said that she 
> counted for UCSC transcripts (rather than genes) but if she took care to 
> avoid the common pitfalls just described, she must have employed a quite 
> sophisticated construction. This is why my first question was how she 
> obtained the counts.
> 
>  > Right now everyone believes RNA-Seq analysis
>  > requires read counting, but honestly I don't see why that is.
> 
> Just to clarify this: I don't think that this is universally believed. 
> It is only that a method is typically designed for a specific type of 
> data, and several commonly used Bioconductor packages for RNA-Seq 
> analysis, namely edgeR, DESeq, BaySeq, DSS, expect count data, because 
> the specific statistical methods used in these packages are build on the 
> assumption that they get read counts. Obviously, they will not give 
> correct results if they are given any other kind of data. This is stated 
> quite clearly in the vignettes, but nevertheless people keep asking 
> whether they can supply arbitrary other kind of data such as rounded 
> coverage values. And this insistence in using methods in a manner 
> clearly violating instructions is, quite frankly, a bit frustrating. So 
> when you observed that people like me seem to be quite insistent on the 
> importance of using counts this may have been with respect to these very 
> common question, which are of course specific to certain methods.
> 
>    Simon



More information about the Bioconductor mailing list