[BioC] dexseq zero length exons

Mon Mar 10 15:57:42 CET 2014

Dear Jose,

Thanks a lot for your detailed report!

The cases that you mention are "normal" cases that are originated from 
our exon bining, these are instances where the annotation file contains 
cases like the one you just describe (having a transcript initiation 
site or alternative splice site with single base pair differences). 
These are exon bins covering a single base pair (the length of the exon 
bins is fData(ecs)$end - fData(ecs)$start + 1).

The reason you can get counts and can achieve significance for DEU for 
small exon bins is because the reads mapping to multiple exon bins are 
counted in all the exon bins, so these are cases where the read maps to 
this small exon bins and many other regions of the genome.  I think 
having differences in these exons can still biologically informative 
despite its length, but it always depends on the analysis you are 
doing.  Say, if you are interested in large cassette skipping events you 
could get rid of this small bins and reduce the number of tests.

Best regards,
Alejandro

> Dear Alejandro,
> I have just analysed my results using DEXSeq 1.9.
> I had prepared the gff annotation using the version of 
> dexseq_prepare.annotation.py <http://dexseq_prepare.annotation.py> in 
> DEXSeq 1.9 that contained the -r parameter but counted with an old 
> version, as you advised to me, of the dexseq_count.py since the newest 
> included a check for a NH tag, which my bam files (SOAPSplice) did not 
> contain. Everything worked fine but when checking some of the exons 
> I realised that there are some, and some amongst the DEU results too, 
> that showed length equal zero. I have checked and there are a number 
> of those amongst the GFF. I have checked in the original gtf file I 
> got from Ensembl (v72 hg19) for one of those genes and I do not 
> understand why there should be one of length 0.
> Here you are a histogram of all exons showing very short exons and 0 
> length ones and an example of the gtf and gff file prepared by 
> dexseq_prepare_annotation.py.
> Is it normal? How can there be exons with the same start and end in 
> the gff file? How can they have reads assigned and even pass the tests?
>
> # exons length
> exon.length<-fData(ecs)$end - fData(ecs)$start
> hist(log2(exon.length),breaks=100)
>
> The gff file 'grepped' for one example:
>
> $ grep 'ENSG00000001460' Homo_sapiens.ensembl72.DEXSeq.gff
>
> chr1dexseq_prepare_annotation.pyaggregate_gene2468348924743424.-.gene_id 
> "ENSG00000001460"
>
> *chr1dexseq_prepare_annotation.pyexonic_part_24683489__24683489_.-.transcripts 
> "ENST00000374409"; exonic_part_number "001"; gene_id "ENSG00000001460"*
>
> chr1dexseq_prepare_annotation.pyexonic_part2468349024683494.-.transcripts 
> "ENST00000440416+ENST00000374409"; exonic_part_number "002"; gene_id 
> "ENSG00000001460"
>
> chr1dexseq_prepare_annotation.pyexonic_part2468349524683526.-.transcripts 
> "ENST00000468303+ENST00000440416+ENST00000003583+ENST00000374409"; 
> exonic_part_number "003"; gene_id "ENSG00000001460"
>
> chr1dexseq_prepare_annotation.pyexonic_part2468352724685109.-.transcripts 
> "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST00000337248"; 
> exonic_part_number "004"; gene_id "ENSG00000001460"
>
> chr1dexseq_prepare_annotation.pyexonic_part2468734124687531.-.transcripts 
> "ENST00000468303+ENST00000440416+ENST00000374409+ENST00000003583+ENST00000337248"; 
> exonic_part_number "005"; gene_id "ENSG00000001460"
>
> chr1dexseq_prepare_annotation.pyexonic_part2469519424695532.-.transcripts 
> "ENST00000497384"; exonic_part_number "006"; gene_id "ENSG00000001460"
>
> Now the gtf used as input for dexseq_prepare_annotation.py for that 
> gene searching for the start and end coordinate:
>
> grep 'ENSG00000001460' hg19.ensGene_withGeneName.gtf | grep '24683489'
>
> chr1hg19_ensGeneexon*2468348924685109*0.000000-.gene_id 
> "ENSG00000001460"; transcript_id "*ENST00000374409*"; gene_name "STPG1";
>
> It seems that the gtf is OK and there is an exon that starts at that 
> position but nothing at the end.
>
> I checked also the ecs object coming from an older analysis of the 
> same dataset using the two py scripts from the same release that gave 
> me a good results but I wanted to repeat to include the -r parameter, 
> and it also contains a lot of these zero length exons.
>
> Is it normal?
>
> I have further checked within the Genome Browser and it seems that 
> there is another Transcript with start at +1 with respect to that.
>
> If this is so it means that zero-length in the end-start difference 
> from the gff file that I found in the fData of the ecs object is 
> indeed 1bp length and just a matter of 0/1 conventions. If this is so, 
> I would be more relieved, but please confirm.
>
> In anycase, may I ask how reliable counts of ~75bp read can be to an 
> exon made from 1 pb difference in the start of two transcripts in the 
> annotation? how can they even pass the filter of p-value for DEU 
> analysis? Shouldn't it be a sort of cut off for this 'pseudo'exons?
>
>
> Thanks again for your help
>
> Best
>
> Jose
>
>
>
>
>
>
>
>
>
> -- 
> Jose M. Garcia Manteiga PhD
> Research Assistant - Data Analysis in Functional Genomics
> Center for Translational Genomics and BioInformatics
> Dibit2-Basilica, 3A3
> San Raffaele Scientific Institute
> Via Olgettina 58, 20132 Milano (MI), Italy
>
> Tel: +39-02-2643-4751
> e-mail: garciamanteiga.josemanuel at hsr.it 
> <mailto:garciamanteiga.josemanuel at hsr.it>