[Bioc-devel] Weird monkey identifiers in org.Hs.eg.db

James W. MacDonald jm@cdon @end|ng |rom uw@edu
Tue Apr 23 17:03:02 CEST 2019


Looks like the ensembl table of the human.db0 package got polluted with *Pan
troglodytes* genes:

> con <- dbConnect(SQLite(),
"/R-devel/lib64/R/library/human.db0/extdata/chipsrc_human.sqlite")
> dbGetQuery(con, "select count(*) from ensembl where ensid like
'ENSPTR%';")
  count(*)
1    16207
> dbGetQuery(con, "select count(*) from ensembl where ensid like 'ENSG%';")
  count(*)
1    28973

On Mon, Apr 22, 2019 at 11:54 PM Aaron Lun <
infinite.monkeys.with.keyboards using gmail.com> wrote:

> Playing around with org.Hs.eg.db 3.8.0. What on earth is ENSPTRG0000...?
>
>  > library(org.Hs.eg.db)
>  > mapIds(org.Hs.eg.db, key="GCG", keytype="SYMBOL", column="ENSEMBL")
> 'select()' returned 1:many mapping between keys and columns
>                   GCG
> "ENSPTRG00000000777"
>
> Well, at least it still recovers the right identifier... eventually.
>
>  > select(org.Hs.eg.db, key="GCG", keytype="SYMBOL", columns="ENSEMBL")
> 'select()' returned 1:many mapping between keys and columns
>    SYMBOL            ENSEMBL
> 1    GCG ENSPTRG00000000777
> 2    GCG    ENSG00000115263
>
> The SYMBOL->Entrez ID relational table seems to be okay:
>
>  > Y <- toTable(org.Hs.egSYMBOL)
>  > Y[which(Y[,2]=="GCG"),]
>       gene_id symbol
> 2152    2641    GCG
>
> So the cause is the Ensembl->Entrez mappings:
>
>  > Z <- toTable(org.Hs.egENSEMBL2EG)
>  > Z[Z[,1]==2641,]
>       gene_id         ensembl_id
> 3028    2641 ENSPTRG00000000777
> 3029    2641    ENSG00000115263
>
> Googling suggests that ENSPTRG00000000777 is an identifier for some
> other gene in one of the other monkeys. Hardly "Hs" stuff.
>
> Session info (not technically R 3.6, but I didn't think that would have
> been the cause):
>
> > R Under development (unstable) (2019-04-11 r76379)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > Running under: Ubuntu 18.04.2 LTS
> >
> > Matrix products: default
> > BLAS:   /home/luna/Software/R/trunk/lib/libRblas.so
> > LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so
> >
> > locale:
> >  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> >  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] parallel  stats4    stats     graphics  grDevices utils     datasets
> > [8] methods   base
> >
> > other attached packages:
> > [1] org.Hs.eg.db_3.8.0   AnnotationDbi_1.45.1 IRanges_2.17.5
> > [4] S4Vectors_0.21.23    Biobase_2.43.1       BiocGenerics_0.29.2
> >
> > loaded via a namespace (and not attached):
> >  [1] Rcpp_1.0.1      digest_0.6.18   DBI_1.0.0       RSQLite_2.1.1
> >  [5] blob_1.1.1      bit64_0.9-7     bit_1.1-14      compiler_3.7.0
> >  [9] pkgconfig_2.0.2 memoise_1.1.0
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list