[BioC] understanding GOstats p-value
James MacDonald
jmacdon at med.umich.edu
Sun Jan 6 00:34:12 CET 2008
Hi Janet,
Interpreting p-values for the hypergeometric test is not
straightforward. One of the underlying assumptions of the hypergeometric
is that the individual things being chosen are independent (think balls
in an urn). Unfortunately, this is not true of genes or GO terms.
There are at least two types of dependence here. First, the expression
of genes is not independent -- one gene can affect the expression of
another. Second, the GO terms are set up as a directed acyclic graph,
with child terms being subsets of the parent terms, so there is another
level of dependence. You can use the conditional test to help limit this
second level of dependence, but there isn't too much you can do about
the first.
Because of this unknown dependence structure it is difficult to do any
multiple testing correction for the hypergeometric for a single
comparison, not to mention multiple comparisons. One thing I have done
in the past for a single comparison is to do a monte carlo resampling in
which you randomly select n 'differentially expressed' genes (where n is
the number of observed differentially expressed genes that you have
observed) and then see how many significant GO terms you get. Do this
say 500 or 1000 times, and you will know how many terms you expect to
see by chance alone, which gives you an estimate of the number of false
positives in your observed results. Unfortunately, this is very time
consuming, and I'm not sure if you could scale to multiple comparisons.
However, if you just have a small number of terms significant, it
shouldn't be too difficult to do downstream validation of that result.
Best,
Jim
Janet Young wrote:
> Hi,
>
> I have a fairly naive question - I want to make sure I can more or
> less understand the p-values that GOstats hyperGTest comes out with.
> Am I right in thinking the p-value is for enrichment of each category
> individually (i.e. NOT corrected for multiple testing)?
>
> I'm analyzing array CGH data so I am testing a lot of categories (my
> universe is all human genes that have a chromosome position, GO
> category and entrez ID). Below is an example result - my
> interpretation is that I shouldn't get super-excited about finding 3
> categories with p<0.001 if I've tested 2261 categories (would expect
> about 2 false positives). Have I understood that correctly?
>
> > hgCondOver
> Gene to GO BP Conditional test for over-representation
> 2261 GO BP ids tested (3 have p < 0.001)
> Selected gene set size: 1433
> Gene universe size: 12325
> Annotation package: org.Hs.eg.db
> > summary(hgCondOver)
> GOBPID Pvalue OddsRatio ExpCount Count Size
> GO:0007156 GO:0007156 0.0001330755 2.470839 12.905720 27 111
> GO:0001894 GO:0001894 0.0007587546 5.553301 2.209087 8 19
> GO:0007600 GO:0007600 0.0009353695 1.446591 74.062556 100 637
> Term
> GO:0007156 homophilic cell adhesion
> GO:0001894 tissue homeostasis
> GO:0007600 sensory perception
>
> thanks very much,
>
> Janet Young
>
> -------------------------------------------------------------------
>
> Dr. Janet Young (Trask lab)
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Avenue N., C3-168,
> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>
> tel: (206) 667 1471 fax: (206) 667 6524
> email: jayoung at fhcrc.org
>
> http://www.fhcrc.org/labs/trask/
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, MS
Biostatistician
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
More information about the Bioconductor
mailing list