[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
Robert Castelo
robert.castelo at upf.edu
Mon Jan 11 19:40:32 CET 2016
hi,
On 01/11/2016 04:07 PM, Vincent Carey wrote:
[...]
> Is it true that there is an asymmetry between Entrez gene ID and Ensembl
> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
> as a symbol mapping resource)? Both ENTREZID and ENSEMBL are listed as
> keytypes. My question is whether this "anchor" concept
> holds in the current infrastructure.
you're right that the infrastructure is probably symmetric at least
between Entrez and Ensembl, so maybe i'm not using the term "anchor"
correctly here, i'm just referring to the fact that many package
functions and use cases of BioC are based in, or illustrated, using
Entrez IDs. examples are:
head(org.Hs.eg.db::keys(org.Hs.eg.db))
[1] "1" "2" "3" "9" "10" "11"
i.e., by default the 'keytype' is 'ENTREZID'
genefilter::nsFilter() argument 'require.entrez' filters out features
without an Entrez Gene ID annotation.
Category::categoryToEntrezBuilder() returns a list mapping category ids
to the Entrez Gene ids annotated at the cateogry id.
SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a
keytype to map ranges to genes. By default the keytype is 'ENTREZID'
some of the workflows are also based on Entrez IDs, such as:
http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources
http://www.bioconductor.org/help/workflows/variants
so if the user just replaces the txdb object in one of those examples or
argument functions by a txdb object that does not have Entrez
identifiers as primary gene key, those functions, examples or workflows
will require modification. this is not necessarily bad, but may put more
burden on the user who is learning with a "default" TxDb human gene
annotation package. this has been so far the *.UCSC.knownGene using
Entrez as gene identifiers. given the apparent discontinuity of UCSC
with the known gene track, i would suggest to put available at the BioC
site another default gene annotation package, but then one based on
Entrez identifiers given the amount of legacy code and documentation
using Entrez in one way or another.
an alternative to translating the default Ensembl Gencode identifiers
into Entrez would be to just take the NCBI RefSeq annotations as human
gene annotation package available by default, i.e., replacing current
*.UCSC.knownGene by *.UCSC.refGene
robert.
More information about the Bioc-devel
mailing list