[BioC] HT-seq counting - gene vs isoform
Simon Anders
anders at embl.de
Wed Dec 5 11:35:50 CET 2012
Hi
On 04/12/12 23:46, Akula, Nirmala (NIH/NIMH) [C] wrote:
> When counting at gene level, I assume the reads that fall on all exons (Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA.
>
> When counting at isoform level,
>
> GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or) just reads that map to Exon2?
> GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because its exons are common with isoform1 and isoform3?
> GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4?
Always the latter. This is why htseq-count is not suitable to count at
isoform level.
To explain the rationale behind this:
HTSeq-count is meant to be used for differential expression analysis;
hence the rule that ambiguous mappings are discarded. Consider two genes
that share part of their sequence, one of them being differentially
expressed, the other not. If we count reads mapping to the shared part
(and hence to both genes), we will wrongly conclude that they are _both_
differentially expressed. If we discard the reads mapping to the shared
part, we underestimate both genes' expression but we do so by the same
fraction in all samples so that any inference about expression changes
is still correct.
For counting at gene level, we can afford to discard the rather few
reads that map to shared sequence. (With long reads, there is few such
stretches longer than the read length even between paralogs.) For
isoforms, this becomes untenable, and hence, any attempt of inferring
differential expression at the isoform level is bound to fail if it is
based on simple counting.
Instead, one should either use some method based on Bayesian inference
(e.g. BitSeq) or perform the inference on the exon level (our DEXSeq
approach). See our paper for a discussion why the prefer the latter and
see Glaus et al.'s paper to learn more about the former.
Simon
More information about the Bioconductor
mailing list