[Bioc-devel] RFC: Naming scheme for organism level annotation data packages

Seth Falcon sfalcon at fhcrc.org
Mon Jul 23 16:15:10 CEST 2007

Hi Sean,

Sean Davis <sdavis2 at mail.nih.gov> writes:
> Since Seth et al. have produced a wonderfully useful db-based system,
> it seems that these data packages could be much more flexible from an
> ID point of view.  One has a primary ID associated with the data
> package, but mappings, to the extent that they are available, could
> also be included.  Then, you could have something like:
> primaryKey(org.Hs.mappings)
> [1] "EntrezGene"
> availableKeys(org.Hs.mappings)
>    KeyType   ExampleValue
> [1] "EntrezGene"   9923
> [2] "EnsemblGene"   ENSG00000273213
> [3] "HUGOSymbol"   BRCA1

Interesting.  Although I can see how this would work from a DB point
of view, it isn't clear to me that such a combined packge would be
feasible/desirable.  If the IDs are more or less different names for
the same things, then no problem.  But if a new ID induces an entirely
new mapping of all the downstream relations, well, the resulting DB
size could be prohibitive.

Your pseudocode suggests the notion of a package-level object
"org.Hs.mappings".  That isn't something we've implemented in
AnnotationDbi, but I like the idea.

I'd like to point out that we have a number of the SQLite-based
annotation data packages available in devel and this would be a great
time for interested parties to give them a try and send us feedback.

The packages should work as drop-in replacements for the
environment-based packages.  There are some additional features which
currently are only documented in the AnnotationDbi vignette.

> The reasons that I like this approach are:
> 1) Each organism package then need be created only once and the
> expectation would be that most of the appropriate mappings would be
> included.

It seems to me that this only works if the IDs are nearly equivalent.
If not, each "primary ID" needs to be deeply involved in the process
of creating the DB tables.

> 2) Standardizes mappings between ID types--individual users can rely
> on a standard mapping with version information (Nothing worse than an
> external mapping source "updating" halfway through a project)
> 3) Allows one "pipeline" for the production of the annotation and
> primary keys, while allowing flexibility in the production of
> secondary mappings (an arbitrary number of mappings can be added; one
> could even imagine allowing users to add their own mappings quite
> easily to the database with a single function)
> 4) Software immediately becomes more useful without much increased
> complexity
> 5) Could be extended to have multiple primary keytypes in the same
> data package with automatic key conversions.

Let me know if I'm misunderstanding, but here I think you are
describing a system that would define a mapping, say, from enseml to
EG and it isn't clear to me that this is what someone wanting ensembl
annotation would really want -- it would allow them to work with
ensembl IDs, but using EG annotation.

Best Wishes,

+ seth

Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center

More information about the Bioc-devel mailing list