[BioC] SQLForge and probes that map to multiple genes

Wed Jul 16 19:11:53 CEST 2008

Mark Cowley wrote:
> Hi Marc, Sean and list.
>
> If I can follow up on Marc's comment:
> "The thing that has me scratching my head is why you would want to map 
> multiple genes onto a single probe in your annotation package?"
>
> The genomics annotation problem (what does this ProbeSet detect, and 
> which ProbeSets detect my gene of interest) is inherently many to 
> many, that is, one ProbeSet can map to many 'genes' (or at least many 
> different accessions that point to the same gene), and that 1 'gene' 
> can map to multiple ProbeSets (perhaps different isoforms).
>
> Does SQLforge handle these inevitable situations nicely?
> Having read the SQLForge pdf documentation, and this post, it seems 
> that you can only provide at most 2 accessions for each ProbeSet, 
> perhaps a RefSeq accession, and if that is not known, a GenBank 
> accession.
>
> If this has been discussed elsewhere, can someone please point me in 
> the right direction?
>
> Cheers,
>
> Mark
> -----------------------------------------------------
> Mark Cowley, BSc (Bioinformatics)(Hons)
>
> Peter Wills Bioinformatics Centre
> Garvan Institute of Medical Research, Sydney, Australia
> -----------------------------------------------------
> On 15/07/2008, at 6:57 AM, Marc Carlson wrote:
>
Hi Mark,

In its current form, SQLForge takes as many IDs as you want to give it, 
but it currently assumes that you only intended to assign one kind of 
gene to a given probe at a time.  That is, it assumes that when you made 
the probe that you really only meant to measure one thing.  It is well 
understood by all of us who make annotation packages that in practice 
this may not always work out as you intended.  But what was confusing me 
was why you would want to deal with ambiguous probes by creating an 
ambiguous database?  It seems to me that it might really be better to 
just not make a gene assignment if you really don't know what your probe 
is measuring.  If a probe is known to be sticking to more than one 
thing, then the interpretation of any measurement from that probe really 
becomes very speculative since you will have no way of knowing what 
proportion of the signal belongs to what.  I agree with Sean that in the 
rare case like this you will really want to look at a recent blast 
alignment for your mystery probe.  But since a case like that really is 
(ultimately) a mystery probe, I feel quite hesitant to assign multiple 
identities to it inside of an annotation package...

Just for the sake of clarification, it is not the case that SQLForge 
will only take two kinds of IDs at a time for mapping.  One of the 
parameters (otherSrc) takes a vector of filenames so you can pass 
several different mappings into that parameter at once if desired.  Many 
major ID types are supported as a way to tell SQLForge what gene to 
assign, but once it has an assignment it will then go and get all the 
data for the database from public sources.  So all your mapping files 
are just a hook to let SQLForge find the rest of the information.  In 
most cases, your initial mapping will probably be complete enough to 
render the extra data that is passed into the otherSrc parameter as 
redundant.

I hope this clarifies things,

  Marc