[Bioc-devel] RFC: Naming scheme for organism level annotation data packages
sdavis2 at mail.nih.gov
Sat Jul 21 18:58:09 CEST 2007
Seth Falcon wrote:
> Wolfgang Huber <huber at ebi.ac.uk> writes:
>> Hi Seth,
>> sounds good to me.
>> One possible option I wanted to throw into the ring to solve the
>> identifier system problem and at the same be at least conceptually
>> prepared for annotations of multi-species systems (e.g. host-pathogen,
>> say, man/anopheles/plasmodium) would be to use name of the name of
>> identifier system (EG) as the prefix rather than "org".
> That was something we discussed. The down sides of that are:
> - What would you put for an updated version of the YEAST package?
> - How would you indentify organism-level packages? [Perhaps your
> point is that this may not really be all that useful so isn't
> worth considering].
Since Seth et al. have produced a wonderfully useful db-based system, it
seems that these data packages could be much more flexible from an ID
point of view. One has a primary ID associated with the data package,
but mappings, to the extent that they are available, could also be
included. Then, you could have something like:
 "EntrezGene" 9923
 "EnsemblGene" ENSG00000273213
 "HUGOSymbol" BRCA1
And tools for getting data:
mget(mykeys, org.Hs.mappingsSYMBOL) #expects mykeys to be EntrezGene
mget(mykeys, org.Hs.mappingsSYMBOL,keytype="EnsemblGene") #does lookup
of EnsemblGene to EntrezGene and then does the mget
# under the hood, this is a simple join in sql
Software using such annotation packages automatically becomes hugely
more powerful. Alternatively, a MAPPINGENVIRONMENT could be included
that could do the up-front mapping from one ID type to the primary key
(and back again) and then software could remain largely unchanged from
the current situation (assuming there is a single primary key).
The reasons that I like this approach are:
1) Each organism package then need be created only once and the
expectation would be that most of the appropriate mappings would be
2) Standardizes mappings between ID types--individual users can rely on
a standard mapping with version information (Nothing worse than an
external mapping source "updating" halfway through a project)
3) Allows one "pipeline" for the production of the annotation and
primary keys, while allowing flexibility in the production of secondary
mappings (an arbitrary number of mappings can be added; one could even
imagine allowing users to add their own mappings quite easily to the
database with a single function)
4) Software immediately becomes more useful without much increased
5) Could be extended to have multiple primary keytypes in the same data
package with automatic key conversions.
Of course, some attention would need to be paid to documenting the
source of the alternative mappings, but the alternative mappings are
readily available. With the adoption of a sql backend for these
packages, all of this becomes doable with the adoption of a single table
(or two, if one includes the "availableKey" information in a separate
table--a good idea, in my opinion) and some infrastructure for doing the
lookups (API level infrastructure, since the backend is a simple join).
All of this said, I am not so intimately involved to know how much work
this would actually entail, but I think since we are talking about
making changes, it is worthwhile entertaining various options.
More information about the Bioc-devel