[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
Cook, Malcolm
MEC at stowers.org
Fri Jun 7 21:32:08 CEST 2013
Taking a quick look at the GFF in question ... I don't see any mRNA features.... they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)
That is your first problem.
In particular, the first gene in a file by itself is rejected. Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF.
##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_011025.1 1 820453
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272
NC_011025.1 RefSeq region 1 820453 . + . ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=158L3-1
NC_011025.1 RefSeq gene 107 1471 . + . ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_tag=MARTH_orf001
NC_011025.1 RefSeq CDS 107 1471 . + 0 ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4
NC_011025.1 RefSeq mRNA 107 1471 . + 0 ID=mRNA0;Parent=gene0
NC_011025.1 RefSeq exon 107 1471 . + 0 ID=exon0;Parent=mRNA0
The question is what to do.
Not sure.
Any other help?
Good luck,
~Malcolm
-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Thomas Girke
Sent: Friday, June 07, 2013 12:52 PM
To: bioconductor at r-project.org
Cc: Brandon Gallaher
Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
It seems to me that makeTranscriptDbFromGFF does not yet work on the
bacteria GFFs from NCBI (perhaps others too):
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
## For instance, the following
library(GenomicFeatures)
download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff")
txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact")
## returns this error:
extracting transcript information
Error in .prepareGFF3TXS(data) :
No Transcript information present in gff file
I guess this is because in bacteria GFF we don't have explicit
transcript annotations. There are hacks around this problem, but it
would be nice if this could be supported in the future right out of the
box. I apologize if I missed an existing solution for this.
Best,
Thomas
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] parallel stats graphics utils datasets grDevices methods base
other attached packages:
[1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list