[Bioc-devel] RFC: Naming scheme for organism level annotation data packages

Sat Jul 21 18:58:09 CEST 2007

Seth Falcon wrote:
> Wolfgang Huber <huber at ebi.ac.uk> writes:
>
>   
>> Hi Seth,
>>
>> sounds good to me.
>>
>> One possible option I wanted to throw into the ring to solve the
>> identifier system problem and at the same be at least conceptually
>> prepared for annotations of multi-species systems (e.g. host-pathogen,
>> say, man/anopheles/plasmodium) would be to use name of the name of
>> identifier system (EG) as the prefix rather than "org".
>>     
>
> That was something we discussed.  The down sides of that are:
>
>   - What would you put for an updated version of the YEAST package?
>
>   - How would you indentify organism-level packages?  [Perhaps your
>     point is that this may not really be all that useful so isn't
>     worth considering].
>   

Since Seth et al. have produced a wonderfully useful db-based system, it 
seems that these data packages could be much more flexible from an ID 
point of view.  One has a primary ID associated with the data package, 
but mappings, to the extent that they are available, could also be 
included.  Then, you could have something like:

primaryKey(org.Hs.mappings)
[1] "EntrezGene"

availableKeys(org.Hs.mappings)
    KeyType   ExampleValue
[1] "EntrezGene"   9923
[2] "EnsemblGene"   ENSG00000273213
[3] "HUGOSymbol"   BRCA1

And tools for getting data:

mget(mykeys, org.Hs.mappingsSYMBOL) #expects mykeys to be EntrezGene

mget(mykeys, org.Hs.mappingsSYMBOL,keytype="EnsemblGene") #does lookup 
of EnsemblGene to EntrezGene and then does the mget
# under the hood, this is a simple join in sql

Software using such annotation packages automatically becomes hugely 
more powerful.  Alternatively, a MAPPINGENVIRONMENT could be included 
that could do the up-front mapping from one ID type to the primary key 
(and back again) and then software could remain largely unchanged from 
the current situation (assuming there is a single primary key).

The reasons that I like this approach are:
1) Each organism package then need be created only once and the 
expectation would be that most of the appropriate mappings would be 
included.
2) Standardizes mappings between ID types--individual users can rely on 
a standard mapping with version information (Nothing worse than an 
external mapping source "updating" halfway through a project)
3) Allows one "pipeline" for the production of the annotation and 
primary keys, while allowing flexibility in the production of secondary 
mappings (an arbitrary number of mappings can be added; one could even 
imagine allowing users to add their own mappings quite easily to the 
database with a single function)
4) Software immediately becomes more useful without much increased 
complexity
5) Could be extended to have multiple primary keytypes in the same data 
package with automatic key conversions.

Of course, some attention would need to be paid to documenting the 
source of the alternative mappings, but the alternative mappings are 
readily available.  With the adoption of a sql backend for these 
packages, all of this becomes doable with the adoption of a single table 
(or two, if one includes the "availableKey" information in a separate 
table--a good idea, in my opinion) and some infrastructure for doing the 
lookups (API level infrastructure, since the backend is a simple join).

All of this said, I am not so intimately involved to know how much work 
this would actually entail, but I think since we are talking about 
making changes, it is worthwhile entertaining various options.

Sean