[BioC] understanding GOstats p-value

James MacDonald jmacdon at med.umich.edu
Sun Jan 6 02:57:12 CET 2008



James MacDonald wrote:
> Hi Janet,
> 
> Interpreting p-values for the hypergeometric test is not 
> straightforward. One of the underlying assumptions of the hypergeometric 
> is that the individual things being chosen are independent (think balls 
> in an urn). Unfortunately, this is not true of genes or GO terms.
> 
> There are at least two types of dependence here. First, the expression 
> of genes is not independent -- one gene can affect the expression of 
> another. Second, the GO terms are set up as a directed acyclic graph, 
> with child terms being subsets of the parent terms, so there is another 
> level of dependence. You can use the conditional test to help limit this 
> second level of dependence, but there isn't too much you can do about 
> the first.
> 
> Because of this unknown dependence structure it is difficult to do any 
> multiple testing correction for the hypergeometric for a single 
> comparison, not to mention multiple comparisons. One thing I have done 
> in the past for a single comparison is to do a monte carlo resampling in 
> which you randomly select n 'differentially expressed' genes (where n is 
> the number of observed differentially expressed genes that you have 
> observed) and then see how many significant GO terms you get. Do this 
> say 500 or 1000 times, and you will know how many terms you expect to 
> see by chance alone, which gives you an estimate of the number of false 
> positives in your observed results. Unfortunately, this is very time 
> consuming, and I'm not sure if you could scale to multiple comparisons.

And I should note that this _still_ doesn't take the inter-gene 
dependence into account.


> 
> However, if you just have a small number of terms significant, it 
> shouldn't be too difficult to do downstream validation of that result.
> 
> Best,
> 
> Jim
> 
> 
> Janet Young wrote:
>> Hi,
>>
>> I have a fairly naive question - I want to make sure I can more or  
>> less understand the p-values that GOstats hyperGTest comes out with.   
>> Am I right in thinking the p-value is for enrichment of each category  
>> individually (i.e. NOT corrected for multiple testing)?
>>
>> I'm analyzing array CGH data so I am testing a lot of categories (my  
>> universe is all human genes that have a chromosome position, GO  
>> category and entrez ID).  Below is an example result - my  
>> interpretation is that I shouldn't get super-excited about finding 3  
>> categories with p<0.001 if I've tested 2261 categories (would expect  
>> about 2 false positives).   Have I understood that correctly?
>>
>>  > hgCondOver
>> Gene to GO BP Conditional test for over-representation
>> 2261 GO BP ids tested (3 have p < 0.001)
>> Selected gene set size: 1433
>>      Gene universe size: 12325
>>      Annotation package: org.Hs.eg.db
>>  >  summary(hgCondOver)
>>                 GOBPID       Pvalue OddsRatio  ExpCount Count Size
>> GO:0007156 GO:0007156 0.0001330755  2.470839 12.905720    27  111
>> GO:0001894 GO:0001894 0.0007587546  5.553301  2.209087     8   19
>> GO:0007600 GO:0007600 0.0009353695  1.446591 74.062556   100  637
>>                                 Term
>> GO:0007156 homophilic cell adhesion
>> GO:0001894       tissue homeostasis
>> GO:0007600       sensory perception
>>
>> thanks very much,
>>
>> Janet Young
>>
>> -------------------------------------------------------------------
>>
>> Dr. Janet Young (Trask lab)
>>
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Avenue N., C3-168,
>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>>
>> tel: (206) 667 1471 fax: (206) 667 6524
>> email: jayoung at fhcrc.org
>>
>> http://www.fhcrc.org/labs/trask/
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
James W. MacDonald, MS
Biostatistician
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623



More information about the Bioconductor mailing list