[BioC] DEXSeq: problem with dexseq_prepare_annotation.py
Alejandro Reyes
alejandro.reyes at embl.de
Wed Apr 18 09:30:02 CEST 2012
Dear Stephen,
The problem is that the exon is finishing (148078883) before it starts
(148079388).
Usually the gtf files contain the "left most" position as start position
independently of the strand, also the UCSC annotations but with a few
exceptions as the one you mentioned. The reason of the script complain
is that these gtf parsers in HTSeq were written and tested for ENSEMBL
annotation files, where these kind of "errors" are basically absent.
Assuming that these are simple mistakes in the gtf files, it is easy to
solve them just by flipping the positions of the start and ends when
start > end, or maybe by deleting this genes from the gtf files.
Anyway, these cases are really rare, like ~15 in the human UCSC genome.
Alejandro
> Alejandro, Simon, Wolfgang, et al.:
>
> I'm trying to use the dexseq_prepare_annotation.py script to parse the
> UCSC hg18 genes.gtf GTF file included with the Illumina igenomes
> packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an
> error:
>
> Traceback (most recent call last):
> File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in<module>
> raise ValueError, "Same name found on two chromosomes: %s, %s" % (
> str(l[i]), str(l[i+1]) )
> ValueError: Same name found on two chromosomes:<GenomicFeature:
> exonic_part 'CFB' at chr6_qbl_hap2: 3167392 -> 3167602 (strand '+')>,
> <GenomicFeature: exonic_part 'CFB' at chr6_cox_hap1: 3359983 ->
> 3360325 (strand '+')>
>
> I'm guessing this is because the same gene name is found in two
> separate places. I'm not entirely sure what these two chromosomal
> segments refer to, but I removed them from the GTF file and the python
> script threw another error:
>
> Traceback (most recent call last):
> File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in<module>
> assert l[i].iv.end<= l[i+1].iv.start, str(l[i+1]) + " starts too early"
> AssertionError:<GenomicFeature: exonic_part 'HIST2H3C+HIST2H3A' at
> chr1: 148079388 -> 148078883 (strand '-')> starts too early
>
> I'm really unsure what to make of this or how to fix it. The script
> works without issues with the Ensembl GTF. Any help would be greatly
> appreciated.
>
> Stephen
>
> -----------------------------------------
> Stephen D. Turner, Ph.D.
> Bioinformatics Core Director
> University of Virginia School of Medicine
> bioinformatics.virginia.edu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list