[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

Cook, Malcolm MEC at stowers.org
Fri Jun 7 21:32:08 CEST 2013


Taking a quick look at the GFF in question ... I don't see any mRNA features....  they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)

That is your first problem.

In particular, the first gene in a file by itself is rejected.  Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF.

##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_011025.1 1 820453
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272
NC_011025.1	RefSeq	region	1	820453	.	+	.	ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=158L3-1
NC_011025.1	RefSeq	gene	107	1471	.	+	.	ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_tag=MARTH_orf001
NC_011025.1	RefSeq	CDS	107	1471	.	+	0	ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4
NC_011025.1	RefSeq	mRNA	107	1471	.	+	0	ID=mRNA0;Parent=gene0
NC_011025.1	RefSeq	exon	107	1471	.	+	0	ID=exon0;Parent=mRNA0


The question is what to do.

Not sure.

Any other help?

Good luck,

~Malcolm


-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Thomas Girke
Sent: Friday, June 07, 2013 12:52 PM
To: bioconductor at r-project.org
Cc: Brandon Gallaher
Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes

It seems to me that makeTranscriptDbFromGFF does not yet work on the 
bacteria GFFs from NCBI (perhaps others too): 
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/

## For instance, the following 
library(GenomicFeatures)
download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff")
txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact")
       
## returns this error:
extracting transcript information
Error in .prepareGFF3TXS(data) :
  No Transcript information present in gff file

I guess this is because in bacteria GFF we don't have explicit
transcript annotations. There are hacks around this problem, but it
would be nice if this could be supported in the future right out of the
box. I apologize if I missed an existing solution for this.

Best,

Thomas

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  utils     datasets  grDevices methods   base

other attached packages:
[1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0   Biobase_2.20.0         rtracklayer_1.20.1     GenomicRanges_1.12.0   IRanges_1.18.0         BiocGenerics_0.6.0

loaded via a namespace (and not attached):
 [1] BSgenome_1.28.0   Biostrings_2.28.0 DBI_0.2-5         RCurl_1.95-4.1    RSQLite_0.11.2    Rsamtools_1.12.0  XML_3.96-1.1      biomaRt_2.16.0    bitops_1.0-5      stats4_3.0.0      tools_3.0.0       zlibbioc_1.6.0

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list