[BioC] understanding GOstats p-value

Janet Young jayoung at fhcrc.org
Mon Jan 7 20:45:20 CET 2008

Thanks James and Kevin - that has made things clearer for me.

We are also dealing with a third kind of non-independence in our data  
- array CGH analysis detects large genomic regions of change, and  
genes of similar function (e.g. large gene families like olfactory  
receptors) can be clustered in the genome.

Because of this, we'd planned to do something similar to your  
resampling suggestion - simulate multiple sets of genomic regions of  
the same size distribution as the real data, determine their gene  
content, and do the GOstats analysis on each of the simulated sets.    
 From what you say, this seems a reasonable approach (although as you  
point out, it's time-conssuming - I'm already running into problems  
with how long it takes - I may try distributing this over multiple  
linux cluster nodes, if I can make that happen relatively easily).

On Jan 7, 2008, at 6:40 AM, Kevin R. Coombes wrote:

> All this is certainly true.  However, it is not clear that the  
> dependence makes any real qualitative difference in the results you  
> get.  See, for example
> Gold et al, "Enrichment analysis in high-throughput genomics -  
> accounting for dependency in the NULL", Brief Bioinform 2007; 8:71-77
> where we explicitly worked out the implications (for the  
> distribution) of the dependence between pairs of GO categories and  
> checked some actual data sets to see how much things changed.
> 	kevin
> James MacDonald wrote:
>> Hi Janet,
>> Interpreting p-values for the hypergeometric test is not  
>> straightforward. One of the underlying assumptions of the  
>> hypergeometric is that the individual things being chosen are  
>> independent (think balls in an urn). Unfortunately, this is not  
>> true of genes or GO terms.
>> There are at least two types of dependence here. First, the  
>> expression of genes is not independent -- one gene can affect the  
>> expression of another. Second, the GO terms are set up as a  
>> directed acyclic graph, with child terms being subsets of the  
>> parent terms, so there is another level of dependence. You can use  
>> the conditional test to help limit this second level of  
>> dependence, but there isn't too much you can do about the first.
>> Because of this unknown dependence structure it is difficult to do  
>> any multiple testing correction for the hypergeometric for a  
>> single comparison, not to mention multiple comparisons. One thing  
>> I have done in the past for a single comparison is to do a monte  
>> carlo resampling in which you randomly select n 'differentially  
>> expressed' genes (where n is the number of observed differentially  
>> expressed genes that you have observed) and then see how many  
>> significant GO terms you get. Do this say 500 or 1000 times, and  
>> you will know how many terms you expect to see by chance alone,  
>> which gives you an estimate of the number of false positives in  
>> your observed results. Unfortunately, this is very time consuming,  
>> and I'm not sure if you could scale to multiple comparisons.
>> However, if you just have a small number of terms significant, it  
>> shouldn't be too difficult to do downstream validation of that  
>> result.
>> Best,
>> Jim
>> Janet Young wrote:
>>> Hi,
>>> I have a fairly naive question - I want to make sure I can more  
>>> or  less understand the p-values that GOstats hyperGTest comes  
>>> out with.   Am I right in thinking the p-value is for enrichment  
>>> of each category  individually (i.e. NOT corrected for multiple  
>>> testing)?
>>> I'm analyzing array CGH data so I am testing a lot of categories  
>>> (my  universe is all human genes that have a chromosome position,  
>>> GO  category and entrez ID).  Below is an example result - my   
>>> interpretation is that I shouldn't get super-excited about  
>>> finding 3  categories with p<0.001 if I've tested 2261 categories  
>>> (would expect  about 2 false positives).   Have I understood that  
>>> correctly?
>>>  > hgCondOver
>>> Gene to GO BP Conditional test for over-representation
>>> 2261 GO BP ids tested (3 have p < 0.001)
>>> Selected gene set size: 1433
>>>      Gene universe size: 12325
>>>      Annotation package: org.Hs.eg.db
>>>  >  summary(hgCondOver)
>>>                 GOBPID       Pvalue OddsRatio  ExpCount Count Size
>>> GO:0007156 GO:0007156 0.0001330755  2.470839 12.905720    27  111
>>> GO:0001894 GO:0001894 0.0007587546  5.553301  2.209087     8   19
>>> GO:0007600 GO:0007600 0.0009353695  1.446591 74.062556   100  637
>>>                                 Term
>>> GO:0007156 homophilic cell adhesion
>>> GO:0001894       tissue homeostasis
>>> GO:0007600       sensory perception
>>> thanks very much,
>>> Janet Young
>>> -------------------------------------------------------------------
>>> Dr. Janet Young (Trask lab)
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Avenue N., C3-168,
>>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>>> tel: (206) 667 1471 fax: (206) 667 6524
>>> email: jayoung at fhcrc.org
>>> http://www.fhcrc.org/labs/trask/
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/ 
>>> gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list