[BioC] DEXSeq- dexseq_prepare_annotation.py error

Anitha Sundararajan asundara at ncgr.org
Wed Apr 9 19:50:34 CEST 2014


Hi

I am trying to run dexseq_prepare_annotation.py on a gtf file (Ensembl 
version) downloaded from iGenomes.  The organism is Arabidopsis 
thaliana.  I am constantly getting error messages that look like this:

Traceback (most recent call last):
   File 
"/home/as/R/x86_64-unknown-linux-gnu-library/3.0/DEXSeq/python_scripts/dexseq_prepare_annotation.py", 
line 51, in <module>
     for f in HTSeq.GFF_Reader( gtf_file ):
   File 
"/sw/python/2.7.1/lib/python2.7/site-packages/HTSeq-0.5.4p3-py2.7-linux-x86_64.egg/HTSeq/__init__.py", 
line 221, in __iter__
     ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
   File 
"/sw/python/2.7.1/lib/python2.7/site-packages/HTSeq-0.5.4p3-py2.7-linux-x86_64.egg/HTSeq/__init__.py", 
line 174, in parse_GFF_attribute_string
     raise ValueError, "The attribute string seems to contain mismatched 
quotes."
ValueError: The attribute string seems to contain mismatched quotes.

The command I used is:

/home/as/R/x86_64-unknown-linux-gnu-library/3.0/DEXSeq/python_scripts/dexseq_prepare_annotation.py 
genes.gtf genes.flattened.gff

I tried running the same script for other gtf files in the database 
(human, drosophila) and the script seems to work fine and the gtf files 
look comparable too (at a glance anyway) .  Any help will be appreciated.


A few lines from the gtf file Im using:

1       protein_coding  exon    3631    3913    .       + .       
exon_number "1"; gene_id "AT1G01010"; gene_name "ANAC001"; p_id 
"P20332"; seqedit "false"; transcript_id "AT1G01010.1"; transcript_name 
"AT1G01010.1"; tss_id "TSS22545";
1       protein_coding  CDS     3760    3913    .       + 0       
exon_number "1"; gene_id "AT1G01010"; gene_name "ANAC001"; p_id 
"P20332"; protein_id "AT1G01010.1"; transcript_id "AT1G01010.1"; 
transcript_name "AT1G01010.1"; tss_id "TSS22545";
1       protein_coding  start_codon     3760    3762    . +       
0       exon_number "1"; gene_id "AT1G01010"; gene_name "ANAC001"; p_id 
"P20332"; transcript_id "AT1G01010.1"; transcript_name "AT1G01010.1"; 
tss_id "TSS22545";
1       protein_coding  CDS     3996    4276    .       + 2       
exon_number "2"; gene_id "AT1G01010"; gene_name "ANAC001"; p_id 
"P20332"; protein_id "AT1G01010.1"; transcript_id "AT1G01010.1"; 
transcript_name "AT1G01010.1"; tss_id "TSS22545";

I can send the complete file, need be.

Thanks so much for your help.

Anitha Sundararajan



More information about the Bioconductor mailing list