[BioC] DEXSeq - too many exons in gene
António domingues
amjdomingues at gmail.com
Thu Feb 6 18:01:10 CET 2014
Hi Bioconductors,
I happened upon a funny thing in DEXseq: a gene which appears to have
more exons in the final DEXseq output than the annotation suggests. The
gene ENSMUSG00000027854 (screen-shot from UCSC in attachment) suggests
the 3 exons in a flattened gene model. However, the DEXSeq results lists
13 exons (here showing the output of htseq-count):
grep ENSMUSG00000027854 htseq_count_out.txt
ENSMUSG00000027854:001 0
ENSMUSG00000027854:002 6
ENSMUSG00000027854:003 18
ENSMUSG00000027854:004 0
ENSMUSG00000027854:005 0
ENSMUSG00000027854:006 86
ENSMUSG00000027854:007 0
ENSMUSG00000027854:008 113
ENSMUSG00000027854:009 52
ENSMUSG00000027854:010 76
ENSMUSG00000027854:011 0
ENSMUSG00000027854:012 310
ENSMUSG00000027854:013 554
This comes from the annotation created with:
dexseq_prepare_annotation.py mm10_ensGene.gtf mm10_ensGene.gff
grep ENSMUSG00000027854 ../../data/gtf/mm10_ensGene.gff
chr3 mm10_ensGene.gtf aggregate_gene 102995728 103003914 . + . gene_id
"ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102995728 102995729 . + . transcripts
"ENSMUST00000029447"; exonic_part_number "001"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102995730 102995794 . + . transcripts
"ENSMUST00000029447+ENSMUST00000151065"; exonic_part_number "002";
gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102995795 102995967 . + . transcripts
"ENSMUST00000151065+ENSMUST00000029447+ENSMUST00000119450";
exonic_part_number "003"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102995968 102996048 . + . transcripts
"ENSMUST00000151065"; exonic_part_number "004"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102996049 102996155 . + . transcripts
"ENSMUST00000151065+ENSMUST00000137332"; exonic_part_number "005";
gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102996156 102996261 . + . transcripts
"ENSMUST00000029447+ENSMUST00000137332+ENSMUST00000151065";
exonic_part_number "006"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102996262 102997242 . + . transcripts
"ENSMUST00000151065"; exonic_part_number "007"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102997243 102997351 . + . transcripts
"ENSMUST00000029447+ENSMUST00000137332+ENSMUST00000151065";
exonic_part_number "008"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102997352 102997385 . + . transcripts
"ENSMUST00000029447+ENSMUST00000151065"; exonic_part_number "009";
gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102998490 102998603 . + . transcripts
"ENSMUST00000151065+ENSMUST00000029447+ENSMUST00000119450";
exonic_part_number "010"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 102998604 102999251 . + . transcripts
"ENSMUST00000151065"; exonic_part_number "011"; gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 103001708 103002194 . + . transcripts
"ENSMUST00000029447+ENSMUST00000119450"; exonic_part_number "012";
gene_id "ENSMUSG00000027854"
chr3 mm10_ensGene.gtf exonic_part 103002195 103003914 . + . transcripts
"ENSMUST00000029447"; exonic_part_number "013"; gene_id "ENSMUSG00000027854"
Between exon1 is only 1 base long (?) and exons1 to 4 are contiguous. As
far as I am aware, DEXSeq model should have flattened all of these into
one single "exon". Is this correct? is the error coming from the gtf?
(at the end of the message there is also the gene annotation in the gtf).
This is specially concerning for me because I am interested in selecting
the first and last exon of genes, using the exon ranking from DEXSeq, to
analyze further.
Thanks,
António
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] grDevices datasets stats graphics utils methods base
other attached packages:
[1] DEXSeq_1.4.0 GenomicFeatures_1.10.2 GenomicRanges_1.10.5
[4] IRanges_1.16.6 data.table_1.8.9 stringr_0.6.2
[7] ggplot2_0.9.3.1 AnnotationDbi_1.20.2 Biobase_2.18.0
[10] BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] BSgenome_1.26.1 Biostrings_2.26.3 DBI_0.2-5
MASS_7.3-23
[5] RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.2
Rsamtools_1.10.2
[9] XML_3.98-1.1 biomaRt_2.14.0 bitops_1.0-6
colorspace_1.2-4
[13] dichromat_2.0-0 digest_0.6.3 grid_2.15.2
gtable_0.1.2
[17] hwriter_1.3 labeling_0.2 munsell_0.4.2
parallel_2.15.2
[21] plyr_1.8 proto_0.3-10 reshape2_1.2.2
rtracklayer_1.18.1
[25] scales_0.2.3 statmod_1.4.17 stats4_2.15.2
tools_2.15.2
[29] zlibbioc_1.4.0
grep ENSMUSG00000027854 ../../data/gtf/mm10_ensGene.gtf
chr3 ensGene exon 102995728 102995967 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102995809 102995967 . + 0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "1"; exon_id
"ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102996156 102996261 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"2"; exon_id "ENSMUST00000029447.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102996156 102996261 . + 0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "2"; exon_id
"ENSMUST00000029447.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102997243 102997385 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"3"; exon_id "ENSMUST00000029447.3"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102997243 102997385 . + 2 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "3"; exon_id
"ENSMUST00000029447.3"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102998490 102998603 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"4"; exon_id "ENSMUST00000029447.4"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102998490 102998603 . + 0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "4"; exon_id
"ENSMUST00000029447.4"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 103001708 103003914 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"5"; exon_id "ENSMUST00000029447.5"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 103001708 103001806 . + 0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000029447"; exon_number "5"; exon_id
"ENSMUST00000029447.5"; gene_name "ENSMUSG00000027854";
chr3 ensGene start_codon 102995809 102995811 . + 0 gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene stop_codon 103001807 103001809 . + 0 gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000029447"; exon_number
"1"; exon_id "ENSMUST00000029447.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102995730 102997385 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000151065"; exon_number
"1"; exon_id "ENSMUST00000151065.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102998490 102999251 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000151065"; exon_number
"2"; exon_id "ENSMUST00000151065.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102995795 102995967 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102995809 102995967 . + 0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000119450"; exon_number "1"; exon_id
"ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102998490 102998603 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"2"; exon_id "ENSMUST00000119450.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 102998490 102998603 . + 0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000119450"; exon_number "2"; exon_id
"ENSMUST00000119450.2"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 103001708 103002194 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"3"; exon_id "ENSMUST00000119450.3"; gene_name "ENSMUSG00000027854";
chr3 ensGene CDS 103001708 103001806 . + 0 gene_id "ENSMUSG00000027854";
transcript_id "ENSMUST00000119450"; exon_number "3"; exon_id
"ENSMUST00000119450.3"; gene_name "ENSMUSG00000027854";
chr3 ensGene start_codon 102995809 102995811 . + 0 gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene stop_codon 103001807 103001809 . + 0 gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000119450"; exon_number
"1"; exon_id "ENSMUST00000119450.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102996049 102996261 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000137332"; exon_number
"1"; exon_id "ENSMUST00000137332.1"; gene_name "ENSMUSG00000027854";
chr3 ensGene exon 102997243 102997351 . + . gene_id
"ENSMUSG00000027854"; transcript_id "ENSMUST00000137332"; exon_number
"2"; exon_id "ENSMUST00000137332.2"; gene_name "ENSMUSG00000027854";
--
António Miguel de Jesus Domingues, PhD
Postdoctoral researcher
Deep Sequencing Group - SFB655
Biotechnology Center (Biotec)
Technische Universität Dresden
Fetscherstraße 105
01307 Dresden
Phone: +49 (351) 458 82362
Email: antonio.domingues(at)biotec.tu-dresden.de
--
The Unbearable Lightness of Molecular Biology
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Internal_tranbscript.pdf
Type: application/pdf
Size: 8751 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140206/c58f158d/attachment.pdf>
More information about the Bioconductor
mailing list