[BioC] HT-seq counting - gene vs isoform

Wed Dec 5 11:35:50 CET 2012

Hi

On 04/12/12 23:46, Akula, Nirmala (NIH/NIMH) [C] wrote:
> When counting at gene level, I assume the reads that fall on all exons (Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA.
>
> When counting at isoform level,
>
> GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or) just reads that map to Exon2?
> GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because its exons are common with isoform1 and isoform3?
> GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4?

Always the latter. This is why htseq-count is not suitable to count at 
isoform level.

To explain the rationale behind this:

HTSeq-count is meant to be used for differential expression analysis; 
hence the rule that ambiguous mappings are discarded. Consider two genes 
that share part of their sequence, one of them being differentially 
expressed, the other not. If we count reads mapping to the shared part 
(and hence to both genes), we will wrongly conclude that they are _both_ 
differentially expressed. If we discard the reads mapping to the shared 
part, we underestimate both genes' expression but we do so by the same 
fraction in all samples so that any inference about expression changes 
is still correct.

For counting at gene level, we can afford to discard the rather few 
reads that map to shared sequence. (With long reads, there is few such 
stretches longer than the read length even between paralogs.) For 
isoforms, this becomes untenable, and hence, any attempt of inferring 
differential expression at the isoform level is bound to fail if it is 
based on simple counting.

Instead, one should either use some method based on Bayesian inference 
(e.g. BitSeq) or perform the inference on the exon level (our DEXSeq 
approach). See our paper for a discussion why the prefer the latter and 
see Glaus et al.'s paper to learn more about the former.

   Simon