[Bioc-devel] Question about org.Dr.eg.db package

James W. MacDonald jm@cdon @end|ng |rom uw@edu
Thu Aug 13 23:36:20 CEST 2020


Hi Gennady,

That information should probably be cleaned up, and the BiMaps that point
to the location data removed. While the OrgDbs do contain position
information, it's been deprecated, which you would find if you tried to
query using select():

> select(org.Dr.eg.db, "30037", "CHR")
'select()' returned 1:1 mapping between keys and columns
  ENTREZID CHR
1    30037   5
Warning message:
In .deprecatedColsMessage() :
  Accessing gene location information via 'CHR','CHRLOC','CHRLOCEND' is
  deprecated. Please use a range based accessor like genes(), or select()
  with columns values like TXCHROM and TXSTART on a TxDb or OrganismDb
  object instead.

The rationale being that the OrgDb packages are intended to contain
functional annotations, which are not based on any build, and instead are
current as of the construction of the OrgDb package. Since positional
information should be based on a genome release, those data have been
migrated to the TxDb and EnsDb packages, which are based on a given release.

Put a different way, the data in an OrgDb package is downloaded from NCBI
as of a particular date, and the positional data we get are whatever we got
from NCBI on that date. This is obviously a problem for the positional
data, because what we get isn't necessarily build-specific. We get the TxDb
data from the UCSC Genome Browser, which is build specific, so we can tell
end users exactly what build the data come from. Ideally these data would
be defunct in the OrgDb packages, but it hasn't happened yet.

Best,

Jim



On Thu, Aug 13, 2020 at 4:39 PM Margolin, Gennady (NIH/NICHD) [C] via
Bioc-devel <bioc-devel using r-project.org> wrote:

> Hi Vincent,
>
> Thank you for responding.
>
> Here is from the R documentation help page from this package (I have
> version 3.10.0 (I doubt anything changed with the latest one, which is
> 3.11.4)):
>
> -------------------------------------------------
> org.Dr.egCHRLOC {org.Dr.eg.db}
> Entrez Gene IDs to Chromosomal Location
> Description
> org.Dr.egCHRLOC is an R object that maps entrez gene identifiers to the
> starting position of the gene. The position of a gene is measured as the
> number of base pairs.
> The CHRLOCEND mapping is the same as the CHRLOC mapping except that it
> specifies the ending base of a gene instead of the start.
> ……
> -------------------------------------------------
>
> This output also does not show any genome version:
> > org.Dr.eg_dbInfo()
>                  name
>        value
> 1     DBSCHEMAVERSION
>          2.1
> 2             Db type
>        OrgDb
> 3  Supporting package
>  AnnotationDbi
> 4            DBSCHEMA
> ZEBRAFISH_DB
> 5            ORGANISM
>  Danio rerio
> 6             SPECIES
>    Zebrafish
> 7        EGSOURCEDATE
>   2019-Jul10
> 8        EGSOURCENAME
>  Entrez Gene
> 9         EGSOURCEURL
> ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
> 10          CENTRALID
>           EG
> 11              TAXID
>         7955
> 12       GOSOURCENAME
>  Gene Ontology
> 13        GOSOURCEURL
> ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
> 14       GOSOURCEDATE
>   2019-Jul10
> 15     GOEGSOURCEDATE
>   2019-Jul10
> 16     GOEGSOURCENAME
>  Entrez Gene
> 17      GOEGSOURCEURL
> ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
> 18     KEGGSOURCENAME
>  KEGG GENOME
> 19      KEGGSOURCEURL
> ftp://ftp.genome.jp/pub/kegg/genomes
> 20     KEGGSOURCEDATE
>   2011-Mar15
> 21       GPSOURCENAME                          UCSC Genome Bioinformatics
> (Danio rerio)
> 22        GPSOURCEURL
> 23       GPSOURCEDATE
>    2017-Nov1
> 24       ENSOURCEDATE
>   2019-Jun24
> 25       ENSOURCENAME
>      Ensembl
> 26        ENSOURCEURL
> ftp://ftp.ensembl.org/pub/current_fasta
> 27       UPSOURCENAME
>      Uniprot
> 28        UPSOURCEURL
> http://www.UniProt.org/
> 29       UPSOURCEDATE                                          Mon Oct 21
> 14:32:30 2019
>
> From: Vincent Carey <stvjc using channing.harvard.edu>
> Date: Thursday, August 13, 2020 at 2:46 PM
> To: "Margolin, Gennady (NIH/NICHD) [C]" <gennady.margolin using nih.gov>
> Cc: "bioc-devel using r-project.org" <bioc-devel using r-project.org>
> Subject: Re: [Bioc-devel] Question about org.Dr.eg.db package
>
> This should probably be posed to the support site.  What version of the
> package are you using?  Where
> are you seeing coordinates?  I would expect those to be obtained from the
> TxDb package, or perhaps
> from AnnotationHub.
>
>
> > columns(org.Dr.eg.db)
>
>  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"
> "ENSEMBLTRANS"
>
>  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"
>
> [11] "GO"           "GOALL"        "IPI"          "ONTOLOGY"
>  "ONTOLOGYALL"
>
> [16] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"
>
> [21] "SYMBOL"       "UNIGENE"      "UNIPROT"      "ZFIN"
>
>
> On Thu, Aug 13, 2020 at 2:13 PM Margolin, Gennady (NIH/NICHD) [C] via
> Bioc-devel <bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>>
> wrote:
> Hello,
>
> I have a short question – how do I figure the genome version for
> org.Dr.eg.db package? I couldn’t see it in the DESCRIPTION and also it’s
> not in org.Dr.eg_dbInfo() output. It would be nice to know if this is
> danRer11/GRCz11 or some other assembly, as there are coordinates present in
> the DB.
>
> Thank you,
> Gennady
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> The information in this e-mail is intended only for th...{{dropped:31}}



More information about the Bioc-devel mailing list