[BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
Thomas Girke
thomas.girke at ucr.edu
Sat Jun 8 03:16:47 CEST 2013
No worries about the timing. I just wanted to point out that it would be
worthwhile to address this some time in the future since it applies to a
huge number of sequenced genomes. I totally understand the challenge Marc is
facing supporting all these inconsistent data formats.
Thomas
On Fri, Jun 07, 2013 at 11:55:34PM +0000, Hervé Pagès wrote:
> Hi Thomas, Malcolm,
>
> This GFF3 file doesn't pass validation using the online validator at:
>
> http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online
>
> but for a minor problem that doesn't have anything to do with what is
> being discussed here (still, it's sad that an organization like NCBI
> distributes invalid GFF3 files :-/ ).
>
> Anyway you're right Thomas that it would probably make sense to
> be able to import those files with makeTranscriptDbFromGFF().
> Don't know how hard that will be though. Marc being on vacation right
> now, this won't be addressed until 2 weeks from now.
>
> I don't have an easy workaround to offer in the mean time, sorry...
> but maybe someone has?
>
> Cheers,
> H.
>
>
> On 06/07/2013 04:33 PM, Thomas Girke wrote:
> > The perl -> genbank -> gff -> R -> txdb approach seems to work in
> > R-2.15.1 but not in R-3.0.0 anymore. The latter returns
> >
> > txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
> > extracting transcript information
> > Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates
> >
> > Given that there are over 2500 bacteria genomes annotated this way at NCBI,
> > a more generic solution to this problem would be very useful, I think.
> >
> > Thomas
> >
> >
> > On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote:
> >> FYI, bioperl includes bp_genbank2gff3.pl
> >>
> >> which when run as
> >>
> >>> bp_genbank2gff3.pl NC_011025.gbk
> >>
> >> produces NC_011025.gbk.gff (attached)
> >>
> >> which loaded without error with transcript:
> >>
> >>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact")
> >> extracting transcript information
> >> Extracting gene IDs
> >> extracting transcript information
> >> Processing splicing information for gff3 file.
> >> Deducing exon rank from relative coordinates provided
> >> Prepare the 'metadata' data frame ... metadata: OK
> >> Now generating chrominfo from available sequence names. No chromosome length information is available.
> >> Warning messages:
> >> 1: In .deduceExonRankings(exs, format = "gff") :
> >> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
> >> 2: In matchCircularity(chroms, circ_seqs) :
> >> None of the strings in your circ_seqs argument match your seqnames.
> >>> txdb
> >> TranscriptDb object:
> >> | Db type: TranscriptDb
> >> | Supporting package: GenomicFeatures
> >> | Data source: NCBI
> >> | Genus and Species: Some bact
> >> | miRBase build ID: NA
> >> | transcript_nrow: 631
> >> | exon_nrow: 631
> >> | cds_nrow: 631
> >> | Db created by: GenomicFeatures package from Bioconductor
> >> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013)
> >> | GenomicFeatures version at creation time: 1.10.2
> >> | RSQLite version at creation time: 0.11.2
> >> | DBSCHEMAVERSION: 1.0
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
> >> Sent: Friday, June 07, 2013 2:32 PM
> >> To: 'Thomas Girke'; 'bioconductor at r-project.org'
> >> Cc: 'Brandon Gallaher'
> >> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
> >>
> >> Taking a quick look at the GFF in question ... I don't see any mRNA features.... they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)
> >>
> >> That is your first problem.
> >>
> >> In particular, the first gene in a file by itself is rejected. Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF.
> >>
> >> ##gff-version 3
> >> #!gff-spec-version 1.20
> >> #!processor NCBI annotwriter
> >> ##sequence-region NC_011025.1 1 820453
> >> ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272
> >> NC_011025.1 RefSeq region 1 820453 . + . ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=158L3-1
> >> NC_011025.1 RefSeq gene 107 1471 . + . ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_tag=MARTH_orf001
> >> NC_011025.1 RefSeq CDS 107 1471 . + 0 ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4
> >> NC_011025.1 RefSeq mRNA 107 1471 . + 0 ID=mRNA0;Parent=gene0
> >> NC_011025.1 RefSeq exon 107 1471 . + 0 ID=exon0;Parent=mRNA0
> >>
> >>
> >> The question is what to do.
> >>
> >> Not sure.
> >>
> >> Any other help?
> >>
> >> Good luck,
> >>
> >> ~Malcolm
> >>
> >>
> >> -----Original Message-----
> >> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Thomas Girke
> >> Sent: Friday, June 07, 2013 12:52 PM
> >> To: bioconductor at r-project.org
> >> Cc: Brandon Gallaher
> >> Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
> >>
> >> It seems to me that makeTranscriptDbFromGFF does not yet work on the
> >> bacteria GFFs from NCBI (perhaps others too):
> >> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
> >>
> >> ## For instance, the following
> >> library(GenomicFeatures)
> >> download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff")
> >> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact")
> >>
> >> ## returns this error:
> >> extracting transcript information
> >> Error in .prepareGFF3TXS(data) :
> >> No Transcript information present in gff file
> >>
> >> I guess this is because in bacteria GFF we don't have explicit
> >> transcript annotations. There are hacks around this problem, but it
> >> would be nice if this could be supported in the future right out of the
> >> box. I apologize if I missed an existing solution for this.
> >>
> >> Best,
> >>
> >> Thomas
> >>
> >>> sessionInfo()
> >> R version 3.0.0 (2013-04-03)
> >> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>
> >> locale:
> >> [1] C
> >>
> >> attached base packages:
> >> [1] parallel stats graphics utils datasets grDevices methods base
> >>
> >> other attached packages:
> >> [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0
> >>
> >> loaded via a namespace (and not attached):
> >> [1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioconductor
mailing list