[BioC] understanding GOstats p-value
jmacdon at med.umich.edu
Sun Jan 6 02:57:12 CET 2008
James MacDonald wrote:
> Hi Janet,
> Interpreting p-values for the hypergeometric test is not
> straightforward. One of the underlying assumptions of the hypergeometric
> is that the individual things being chosen are independent (think balls
> in an urn). Unfortunately, this is not true of genes or GO terms.
> There are at least two types of dependence here. First, the expression
> of genes is not independent -- one gene can affect the expression of
> another. Second, the GO terms are set up as a directed acyclic graph,
> with child terms being subsets of the parent terms, so there is another
> level of dependence. You can use the conditional test to help limit this
> second level of dependence, but there isn't too much you can do about
> the first.
> Because of this unknown dependence structure it is difficult to do any
> multiple testing correction for the hypergeometric for a single
> comparison, not to mention multiple comparisons. One thing I have done
> in the past for a single comparison is to do a monte carlo resampling in
> which you randomly select n 'differentially expressed' genes (where n is
> the number of observed differentially expressed genes that you have
> observed) and then see how many significant GO terms you get. Do this
> say 500 or 1000 times, and you will know how many terms you expect to
> see by chance alone, which gives you an estimate of the number of false
> positives in your observed results. Unfortunately, this is very time
> consuming, and I'm not sure if you could scale to multiple comparisons.
And I should note that this _still_ doesn't take the inter-gene
dependence into account.
> However, if you just have a small number of terms significant, it
> shouldn't be too difficult to do downstream validation of that result.
> Janet Young wrote:
>> I have a fairly naive question - I want to make sure I can more or
>> less understand the p-values that GOstats hyperGTest comes out with.
>> Am I right in thinking the p-value is for enrichment of each category
>> individually (i.e. NOT corrected for multiple testing)?
>> I'm analyzing array CGH data so I am testing a lot of categories (my
>> universe is all human genes that have a chromosome position, GO
>> category and entrez ID). Below is an example result - my
>> interpretation is that I shouldn't get super-excited about finding 3
>> categories with p<0.001 if I've tested 2261 categories (would expect
>> about 2 false positives). Have I understood that correctly?
>> > hgCondOver
>> Gene to GO BP Conditional test for over-representation
>> 2261 GO BP ids tested (3 have p < 0.001)
>> Selected gene set size: 1433
>> Gene universe size: 12325
>> Annotation package: org.Hs.eg.db
>> > summary(hgCondOver)
>> GOBPID Pvalue OddsRatio ExpCount Count Size
>> GO:0007156 GO:0007156 0.0001330755 2.470839 12.905720 27 111
>> GO:0001894 GO:0001894 0.0007587546 5.553301 2.209087 8 19
>> GO:0007600 GO:0007600 0.0009353695 1.446591 74.062556 100 637
>> GO:0007156 homophilic cell adhesion
>> GO:0001894 tissue homeostasis
>> GO:0007600 sensory perception
>> thanks very much,
>> Janet Young
>> Dr. Janet Young (Trask lab)
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Avenue N., C3-168,
>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>> tel: (206) 667 1471 fax: (206) 667 6524
>> email: jayoung at fhcrc.org
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
James W. MacDonald, MS
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
Ann Arbor MI 48109
More information about the Bioconductor