[BioC] Odds Ratio in GOstat [resolved?]

Robert Gentleman rgentlem at fhcrc.org
Tue Dec 12 18:38:12 CET 2006


Hi,
   In principle (and I think in practice too) it is straightforward to 
modify GOstats (or any hypergeometric testing) to handle the situation 
where you believe that different ESTs represent different isoforms.

   Basically you need to ensure that both the universe and the 
interesting gene list contain one value for all entities (ESTs here) of 
interest. Standard mapping to GO terms is via EntrezGene IDs (AFAIK) and 
so you cannot use them, you can however modify them, so that you get 
unique names for each EST (and keep the mapping to terms).
   eg if EG X had three ESTs on my array, I might rename them X_1, X_2 
and X_3, and make sure that these are in my universe.

   But I guess, if I think sequence is really that important, I would 
look at some sort of groupings other than GO.  I don't know, for example 
how well homology would work and I suspect that no one has done a 
comparative study. I also would worry about ISS annotations (in addition 
to IEA ones).

   best wishes
     Robert

Björn Usadel wrote:
> Dear Naomi,
> 
> 
> if I understand you right, your problem seems to be, that you 
> investigate  the classifications of the best hits of the sequenced 
> organism and not the classes of your actual ESTs.
> 
> In this case, the route I usually take is to transfer the ontological 
> terms onto the ESTs (or better unigenes) and use these for testing. (I 
> use neither GO nor GOstats though).
>  From a biological point of view I think this also makes sense. Just 
> assume your sequenced species has one isoform of a particular enzyme 
> (B), which has expanded to two isoforms (B1 and B2) already, which are 
> not yet completely subfunctionalized etc. So in this case your 
> non-sequenced organism really has two times GO:molecular_function:whatever.
> And also I am more interested in the distribution of genes the organism 
> I am looking at than an already sequenced one. As an extreme case if you 
> inferred GO terms by blasting plants against vertebrates, you will run 
> into the problem of the super expanded gene families in plants (which 
> are for real).
> 
> So to answer your question I would say 3 out of 5.
> 
> However, it is not trivial to transfer ontological terms especially if 
> the original were already "inferred from electronic annotation". Also if 
> you are not so sure about sequence clustering processes (e.g. ESTs B1 
> and B2 should really represent one unigene) things start getting shaky.
> But there are annotation packages like Interpro2GO, blast2go and you 
> name it.
> So to sum this up, I think you should rely on good old sequence based 
> bioinformatics.
> 
> Just my 5 cents though....
> 
> Cheers,
> Björn
> 
> Naomi Altman wrote:
>> The duplicate genes problem is an interesting one.  The reason the 
>> selected gene list includes duplicates is because it comes from 
>> blasting an EST set from an unsequenced species against a sequenced 
>> species.  The duplicates are supposed to be the nearest homolog of 
>> the EST but to represent multiple genes.  How to handle this for GO 
>> enrichment is an interesting question.
>>
>> e.g.  Annotation has genes A B C.
>> We observe that matches A1 A2 and B1 are upregulated, but  B2 and C 
>> are not.  Should we say that 3 out of 5 are upregulated, or 2 out of 3?
>>
>> --Naomi
>>
>> At 07:43 PM 12/11/2006, Seth Falcon wrote:
>>> The selected gene list contained duplicate ids.  I'm pretty sure this
>>> is the problem.  The Category + GOstats code should detect such input
>>> errors and give a sensible error message.  I will add such checking
>>> very soon.
>>>
>>> + seth
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> Naomi S. Altman                                814-865-3791 (voice)
>> Associate Professor
>> Dept. of Statistics                              814-863-7114 (fax)
>> Penn State University                         814-865-1348 (Statistics)
>> University Park, PA 16802-2111
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org



More information about the Bioconductor mailing list