[BioC] understanding GOstats p-value
Kevin R. Coombes
krcoombes at mdacc.tmc.edu
Mon Jan 7 15:40:33 CET 2008
All this is certainly true. However, it is not clear that the
dependence makes any real qualitative difference in the results you get.
See, for example
Gold et al, "Enrichment analysis in high-throughput genomics -
accounting for dependency in the NULL", Brief Bioinform 2007; 8:71-77
where we explicitly worked out the implications (for the distribution)
of the dependence between pairs of GO categories and checked some actual
data sets to see how much things changed.
James MacDonald wrote:
> Hi Janet,
> Interpreting p-values for the hypergeometric test is not
> straightforward. One of the underlying assumptions of the hypergeometric
> is that the individual things being chosen are independent (think balls
> in an urn). Unfortunately, this is not true of genes or GO terms.
> There are at least two types of dependence here. First, the expression
> of genes is not independent -- one gene can affect the expression of
> another. Second, the GO terms are set up as a directed acyclic graph,
> with child terms being subsets of the parent terms, so there is another
> level of dependence. You can use the conditional test to help limit this
> second level of dependence, but there isn't too much you can do about
> the first.
> Because of this unknown dependence structure it is difficult to do any
> multiple testing correction for the hypergeometric for a single
> comparison, not to mention multiple comparisons. One thing I have done
> in the past for a single comparison is to do a monte carlo resampling in
> which you randomly select n 'differentially expressed' genes (where n is
> the number of observed differentially expressed genes that you have
> observed) and then see how many significant GO terms you get. Do this
> say 500 or 1000 times, and you will know how many terms you expect to
> see by chance alone, which gives you an estimate of the number of false
> positives in your observed results. Unfortunately, this is very time
> consuming, and I'm not sure if you could scale to multiple comparisons.
> However, if you just have a small number of terms significant, it
> shouldn't be too difficult to do downstream validation of that result.
> Janet Young wrote:
>> I have a fairly naive question - I want to make sure I can more or
>> less understand the p-values that GOstats hyperGTest comes out with.
>> Am I right in thinking the p-value is for enrichment of each category
>> individually (i.e. NOT corrected for multiple testing)?
>> I'm analyzing array CGH data so I am testing a lot of categories (my
>> universe is all human genes that have a chromosome position, GO
>> category and entrez ID). Below is an example result - my
>> interpretation is that I shouldn't get super-excited about finding 3
>> categories with p<0.001 if I've tested 2261 categories (would expect
>> about 2 false positives). Have I understood that correctly?
>> > hgCondOver
>> Gene to GO BP Conditional test for over-representation
>> 2261 GO BP ids tested (3 have p < 0.001)
>> Selected gene set size: 1433
>> Gene universe size: 12325
>> Annotation package: org.Hs.eg.db
>> > summary(hgCondOver)
>> GOBPID Pvalue OddsRatio ExpCount Count Size
>> GO:0007156 GO:0007156 0.0001330755 2.470839 12.905720 27 111
>> GO:0001894 GO:0001894 0.0007587546 5.553301 2.209087 8 19
>> GO:0007600 GO:0007600 0.0009353695 1.446591 74.062556 100 637
>> GO:0007156 homophilic cell adhesion
>> GO:0001894 tissue homeostasis
>> GO:0007600 sensory perception
>> thanks very much,
>> Janet Young
>> Dr. Janet Young (Trask lab)
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Avenue N., C3-168,
>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>> tel: (206) 667 1471 fax: (206) 667 6524
>> email: jayoung at fhcrc.org
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor