[BioC] DEXSeq- dexseq_prepare_annotation.py error
    Anitha Sundararajan 
    asundara at ncgr.org
       
    Wed Apr  9 19:50:34 CEST 2014
    
    
  
Hi
I am trying to run dexseq_prepare_annotation.py on a gtf file (Ensembl 
version) downloaded from iGenomes.  The organism is Arabidopsis 
thaliana.  I am constantly getting error messages that look like this:
Traceback (most recent call last):
   File 
"/home/as/R/x86_64-unknown-linux-gnu-library/3.0/DEXSeq/python_scripts/dexseq_prepare_annotation.py", 
line 51, in <module>
     for f in HTSeq.GFF_Reader( gtf_file ):
   File 
"/sw/python/2.7.1/lib/python2.7/site-packages/HTSeq-0.5.4p3-py2.7-linux-x86_64.egg/HTSeq/__init__.py", 
line 221, in __iter__
     ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
   File 
"/sw/python/2.7.1/lib/python2.7/site-packages/HTSeq-0.5.4p3-py2.7-linux-x86_64.egg/HTSeq/__init__.py", 
line 174, in parse_GFF_attribute_string
     raise ValueError, "The attribute string seems to contain mismatched 
quotes."
ValueError: The attribute string seems to contain mismatched quotes.
The command I used is:
/home/as/R/x86_64-unknown-linux-gnu-library/3.0/DEXSeq/python_scripts/dexseq_prepare_annotation.py 
genes.gtf genes.flattened.gff
I tried running the same script for other gtf files in the database 
(human, drosophila) and the script seems to work fine and the gtf files 
look comparable too (at a glance anyway) .  Any help will be appreciated.
A few lines from the gtf file Im using:
1       protein_coding  exon    3631    3913    .       + .       
exon_number "1"; gene_id "AT1G01010"; gene_name "ANAC001"; p_id 
"P20332"; seqedit "false"; transcript_id "AT1G01010.1"; transcript_name 
"AT1G01010.1"; tss_id "TSS22545";
1       protein_coding  CDS     3760    3913    .       + 0       
exon_number "1"; gene_id "AT1G01010"; gene_name "ANAC001"; p_id 
"P20332"; protein_id "AT1G01010.1"; transcript_id "AT1G01010.1"; 
transcript_name "AT1G01010.1"; tss_id "TSS22545";
1       protein_coding  start_codon     3760    3762    . +       
0       exon_number "1"; gene_id "AT1G01010"; gene_name "ANAC001"; p_id 
"P20332"; transcript_id "AT1G01010.1"; transcript_name "AT1G01010.1"; 
tss_id "TSS22545";
1       protein_coding  CDS     3996    4276    .       + 2       
exon_number "2"; gene_id "AT1G01010"; gene_name "ANAC001"; p_id 
"P20332"; protein_id "AT1G01010.1"; transcript_id "AT1G01010.1"; 
transcript_name "AT1G01010.1"; tss_id "TSS22545";
I can send the complete file, need be.
Thanks so much for your help.
Anitha Sundararajan
    
    
More information about the Bioconductor
mailing list