[BioC] Bug in makeOrgPackageFromNCBI from AnnotationForge?
Blanchette, Marco
MAB at stowers.org
Sat Aug 24 04:24:30 CEST 2013
I am working on a project involving Schizosaccharomyces pombe as a source for genomic analysis and love to use ReportingTools html producing wrappers. However, I am struggling as there is no AnnotationDbi package available for this organism. I decided to finally take the plunge and try to see if I could be one myself using AnnotationForge and was quite exciting to find that I could perhaps melt one simply by using the makeOrgPackageFromNCBI(). Most likely, something went wrong and I suspect a bug somewhere in the pipeline. I have not dug deeper then trying to build the package and use it hoping that someone closer to the code could shed some lights. Here the steps I took:'
> library(AnnotationForge)
> makeOrgPackageFromNCBI(version = "0.1",
author = "Marco Blanchette <mab at stowers.org>",
maintainer = "Marco Blanchette <mab at stowers.org>",
outputDir = ".",
tax_id = "4896",
genus = "Schizosaccharomyces",
species = "pombe")
This step succeeded with only a warning:
Warning message:
In .makeSimpleTable(ug, table = "unigene", con) :
no values found for table unigene in this data chunk.
I didn't think this was critical enough to raise any red flag, so I then proceeded with the installation that went smoothly
> library(devtools)
> install('org.Spombe.eg.db')
> library('org.Spombe.eg.db')
Then I try to use it with ReportingTools publish() but fail as it returns an error related to Entrez ID which I had a conversion table from biomaRt. I dug a bit deeper and found that none of the genes I was querying were in the database to finally realize that there was only 38 entries int the org.Spombe.eg.db database I had just created and installed... Check this out:
> keytypes(org.Spombe.eg.db)
[1] "ENTREZID" "ACCNUM" "ALIAS" "CHR" "PMID" "REFSEQ"
[7] "SYMBOL" "UNIGENE" "GENENAME" "GO" "EVIDENCE" "ONTOLOGY"
Looking good! However:
> length(keys(org.Spombe.eg.db,'ENTREZID'))
[1] 38
Can someone close enough to the code shed some lights has to whether there is a bug in AnnotationForge or whether it is the NCBI database that is not conforming to what is expected? For instance, biomart has 5117 entrez ID
> library(biomaRt)
> mart <- useMart("fungi_mart_18","spombe_eg_gene")
> ensembl2entrez <- getBM(c('ensembl_gene_id','entrezgene'),mart=mart)
> sum(!is.na(ensembl2entrez$entrezgene))
[1] 5117
The ids I tested on the NCBI website return the correct genes. However, only 10 of the AnnotationForge EntrezID (out of a skirmish 38 ids) are found in biomaRt
> sum(keys(org.Spombe.eg.db,'ENTREZID') %in% ensembl2entrez$entrezgene)
[1] 10
Again, I would appreciate any comments or suggestions as to whether this is a bug or something I did wrong or a miss alignment between the NCBI S. pombe annotation and what is expected by AnnotationForge.
Thanks
--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.
Kansas City, MO 64110
Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018
More information about the Bioconductor
mailing list