[BioC] GOstats question

Wed Mar 30 16:29:36 CEST 2005

On Mar 30, 2005, at 8:58 AM, Rickman David wrote:

> Hi Sean,
>
> What is indicated in the hgu133aACCNUM html for the hgu133a meta-data 
> package is: "For all the Affymetrix chips, the manufacturer/user 
> provided ids are GenBank accession numbers." So the starting material 
> for the pipeline here is GenBank acc #. It seems possible that with 
> this starting material one could potentially reduce the level of 
> ambiguity.
>
> As an example -- take the affy ids 207039_at and 211156_at (NM_000077 
> and AF115544, respective GeneBank# ids).  They correspond to locuslink 
> number 1029.  This number corresponds to 3 transcripts encoding 3 
> proteins (p12, p14 and p16).  GOA attributes same GO_ID 0016301 
> (kinase activity) for both p12 (NP_478104) and p14 (NP_478102) while 
> attributing 8 GO ids for p16 (NP_000068) (none of which are 0016301).  
> Entrez Gene associates AF115544 as the source sequence for NM_058197 
> (NP_478104).  NM_00077 corresponds to the variant NP_000068.  The 
> mapping by Dr. Gentleman et al yields the same 2 GO terms for both 
> probe sets (see example below).  The locuslink (GeneID) # 1029 should 
> yield
>
> Of course using the actual target sequence (which is given by affy) as 
> the starting material would help better to resolve variants as well as 
> permit a proper flagging of problem probe sets (see Mecham et al. 
> Physiol.Genom 2004 and Harbig et al NAR 2005) and ultimately map probe 
> sets to GOA.  But as you indicated, maybe Dr. Gentleman (or maybe 
> Chenwei Lin) could shed some light to why it is better to pass from 
> probe set/accession number provided by affy to locuslink to GO id to 
> study the potential enrichment of GO ids in an affy  microarray 
> experiment.
>

David,

I think Robert answered this indirectly today for another post.  The 
BioConductor team maps based on ID matching in public databases.  In 
order to be general, I think the mapping from genbank accession numbers 
to locuslink (Entrez Gene) is via Unigene.  A GenBank accession number 
is looked up in the Unigene database.  If found, the associated 
locuslink(s) are assigned to that probe.  Then, the information 
contained in locuslink (GO, KEGG, etc) is used to provide further 
annotation.  While for individual sequences (refseqs, in particular), 
it is possible to determine the Gene ID or refseq directly, this is not 
in general possible for GenBank accession numbers without going through 
Unigene (and even this isn't 100% fool-proof).  Note that going through 
Unigene precludes any attempt to work at the transcript (or protein) 
level.

While there are other methods for annotating probesets (see the 
articles you cite above), they all require aligning target or probe 
sequences (also available from Affy) to known entities (like refseq, 
etc.) and is NOT what the BioConductor team attempts to do (and is a 
HUGE task to do well, having done this process for some long oligo 
arrays).  You could do this yourself, if necessary.  Also, you could 
look at Ensembl which does their own annotation of Affymetrix arrays.  
The downside of doing these things yourself (or not using the 
annotation packages provided by bioconductor) is that you then need to 
either modify the nice functions from the bioconductor project to use 
your own data or you need to make your data conform to the structures 
needed for the functions to work (which as you point out, in this case, 
will not suffice).

Hope this helps.
Sean