[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Vincent Carey stvjc at channing.harvard.edu
Mon Jan 11 19:57:07 CET 2016

I think these are all good observations and we may benefit from a wider
discussion on the support site?

the abandonment of knownGene seems to have clear implications for changing
our most visible txdb
examples.  what should we change to?  can we make a more future-proof
design for these annotation selections?

On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo <robert.castelo at upf.edu>

> hi,
> On 01/11/2016 04:07 PM, Vincent Carey wrote:
> [...]
>> Is it true that there is an asymmetry between Entrez gene ID and Ensembl
>> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
>> as a symbol mapping resource)?  Both ENTREZID and ENSEMBL are listed as
>> keytypes.  My question is whether this "anchor" concept
>> holds in the current infrastructure.
> you're right that the infrastructure is probably symmetric at least
> between Entrez and Ensembl, so maybe i'm not using the term "anchor"
> correctly here, i'm just referring to the fact that many package functions
> and use cases of BioC are based in, or illustrated, using Entrez IDs.
> examples are:
> head(org.Hs.eg.db::keys(org.Hs.eg.db))
> [1] "1"  "2"  "3"  "9"  "10" "11"
> i.e., by default the 'keytype' is 'ENTREZID'
> genefilter::nsFilter() argument 'require.entrez' filters out features
> without an Entrez Gene ID annotation.
> Category::categoryToEntrezBuilder() returns a list mapping category ids to
> the Entrez Gene ids annotated at the cateogry id.
> SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a
> keytype to map ranges to genes. By default the keytype is 'ENTREZID'
> some of the workflows are also based on Entrez IDs, such as:
> http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources
> http://www.bioconductor.org/help/workflows/variants
> so if the user just replaces the txdb object in one of those examples or
> argument functions by a txdb object that does not have Entrez identifiers
> as primary gene key, those functions, examples or workflows will require
> modification. this is not necessarily bad, but may put more burden on the
> user who is learning with a "default" TxDb human gene annotation package.
> this has been so far the *.UCSC.knownGene using Entrez as gene identifiers.
> given the apparent discontinuity of UCSC with the known gene track, i would
> suggest to put available at the BioC site another default gene annotation
> package, but then one based on Entrez identifiers given the amount of
> legacy code and documentation using Entrez in one way or another.
> an alternative to translating the default Ensembl Gencode identifiers into
> Entrez would be to just take the NCBI RefSeq annotations as human gene
> annotation package available by default, i.e., replacing current
> *.UCSC.knownGene by *.UCSC.refGene
> robert.

	[[alternative HTML version deleted]]

More information about the Bioc-devel mailing list