[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Mon Jan 11 19:40:32 CET 2016

hi,

On 01/11/2016 04:07 PM, Vincent Carey wrote:
[...]
> Is it true that there is an asymmetry between Entrez gene ID and Ensembl
> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
> as a symbol mapping resource)?  Both ENTREZID and ENSEMBL are listed as
> keytypes.  My question is whether this "anchor" concept
> holds in the current infrastructure.

you're right that the infrastructure is probably symmetric at least 
between Entrez and Ensembl, so maybe i'm not using the term "anchor" 
correctly here, i'm just referring to the fact that many package 
functions and use cases of BioC are based in, or illustrated, using 
Entrez IDs. examples are:

head(org.Hs.eg.db::keys(org.Hs.eg.db))
[1] "1"  "2"  "3"  "9"  "10" "11"

i.e., by default the 'keytype' is 'ENTREZID'

genefilter::nsFilter() argument 'require.entrez' filters out features 
without an Entrez Gene ID annotation.

Category::categoryToEntrezBuilder() returns a list mapping category ids 
to the Entrez Gene ids annotated at the cateogry id.

SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a 
keytype to map ranges to genes. By default the keytype is 'ENTREZID'

some of the workflows are also based on Entrez IDs, such as:

http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources

http://www.bioconductor.org/help/workflows/variants

so if the user just replaces the txdb object in one of those examples or 
argument functions by a txdb object that does not have Entrez 
identifiers as primary gene key, those functions, examples or workflows 
will require modification. this is not necessarily bad, but may put more 
burden on the user who is learning with a "default" TxDb human gene 
annotation package. this has been so far the *.UCSC.knownGene using 
Entrez as gene identifiers. given the apparent discontinuity of UCSC 
with the known gene track, i would suggest to put available at the BioC 
site another default gene annotation package, but then one based on 
Entrez identifiers given the amount of legacy code and documentation 
using Entrez in one way or another.

an alternative to translating the default Ensembl Gencode identifiers 
into Entrez would be to just take the NCBI RefSeq annotations as human 
gene annotation package available by default, i.e., replacing current 
*.UCSC.knownGene by *.UCSC.refGene

robert.