[BioC] GOstats question

Wed Mar 30 17:19:24 CEST 2005

"David,

I think Robert answered this indirectly today for another post.  The 
BioConductor team maps based on ID matching in public databases. " 

I am new to the list and didn't see his posting -- 

"In 
order to be general, I think the mapping from genbank accession numbers 
to locuslink (Entrez Gene) is via Unigene.  A GenBank accession number 
is looked up in the Unigene database.  If found, the associated 
locuslink(s) are assigned to that probe.  Then, the information 
contained in locuslink (GO, KEGG, etc) is used to provide further 
annotation. " 

Even if the design (or the aim of the Bioconductor team) is limited to a
"general approach" which precludes working at the level of protein
product (or transcript) -- which is the basis of the GO annotation and
usually the goal of any test of GO category enrichment for a microarray
result -- then for a given LL # we should have all available GO terms
attributed, right? The example I gave showed that for at least two probe
sets (sharing the same LL #) this is not the case -- we have only 2 GO
terms to work with versus 12 (again using the same reference GOA as a
reference) for a well characterized gene. 

"While there are other methods for annotating probesets (see the 
articles you cite above), they all require aligning target or probe 
sequences (also available from Affy) to known entities (like refseq, 
etc.) and is NOT what the BioConductor team attempts to do (and is a 
HUGE task to do well, having done this process for some long oligo 
arrays).  You could do this yourself, if necessary.  
Also, you could 
look at Ensembl which does their own annotation of Affymetrix arrays.  
The downside of doing these things yourself (or not using the 
annotation packages provided by bioconductor) is that you then need to 
either modify the nice functions from the bioconductor project to use 
your own data or you need to make your data conform to the structures 
needed for the functions to work (which as you point out, in this case, 
will not suffice)."

It looks like that is what it takes to get to core of the problem -- One
of my aims (I am sure like many using Affy data) is to summarize/study
lists of probe sets derived from some test at the level of GO terms.
Therefore it is almost intuitive that key to that aim is to resolve both
the multiplicity issues (many probe sets to one protein product,
somewhat addressed in the GOstats package -- at the level of LocusLink)
as well as the splice variant issues -- otherwise, it seems that
analyses will always stay at a "general" level. 

Thanks for the suggestions and the comments 

David