[BioC] GOstats/Category hyperGTest gene universe problem

Sat May 15 00:53:24 CEST 2010

Hi James,

I know that you got an answer for this from me already, but I want to
post a response to the list.  That was a pretty good question!

In the end everything appears to be working as expected.  Basically, the
gene universe that is used to calculate things  is limited based on the
things being tested.  This is more conservative  than simply assuming
that everything you didn't test could also have  been in the pool. 
Remember that with a hypergeometric test, having a  larger pool of
things that "could have been picked" will artificially  inflate the
significance of your p-value, so you want to use every possible
opportunity to remove untested things from the calculation.  Categories
that are untested are therefore removed from the universe in the final
step.  This is why the KEGG and PFAM universes are smaller than your
estimate of 213.  In the case of PFAM and KEGG, these are sparsely
annotated, so the odds that something will NOT be labeled with an entire
category from  these is pretty good and the universe gets cut way down
(typically to ~130).  But for GO, things are more likely to be larger
(and less variable) because of 1) the fact that more things are labeled
with GO terms and 2) the DAG nature of GO annotations (which is
considered when using GOstats).  So it's not that GO universe can never
be made smaller in this way, it's just that it usually isn't.

  Marc

On 05/11/2010 06:45 AM, James F. Reid wrote:
> Dear list,
>
> when running two instances of a hyperGTest for KEGG pathways using
> different gene lists but identical gene universes I get two different
> gene universes in the results. This isn't the case for GO bp, mf, or
> cc but seems to be also for PFAM. Does anyone know why this is?
>
> Thanks.
> James.
>
> Example and sessionInfo() below.
>
> require("Category")
> require("GOstats")
> require("org.Hs.eg.db")
>
> set.seed(321)
>
> geneUniverse <- sample(mappedkeys(org.Hs.egSYMBOL), 2000)
> geneList1 <- sample(geneUniverse, 200)
> geneList2 <- sample(geneUniverse, 200)
>
> ## KEGG pathways
> parsKEGG1 <- new("KEGGHyperGParams",
>                 universeGeneIds = geneUniverse,
>                 geneIds = geneList1,
>                 annotation = "org.Hs.eg.db",
>                 testDirection = "over",
>                 pvalueCutoff = 0.01)
> parsKEGG2 <- parsKEGG1
> geneIds(parsKEGG2) <- geneList2
>
> testKEGG1 <- hyperGTest(parsKEGG1)
> testKEGG2 <- hyperGTest(parsKEGG2)
> universeMappedCount(testKEGG1) == universeMappedCount(testKEGG2)
> ## [1] FALSE
>
> universeMappedCount(testKEGG1)
> ## [1] 125
> universeMappedCount(testKEGG2)
> ## [1] 137
> sum(geneUniverse %in% mappedkeys(org.Hs.egPATH))
> ## [1] 213
>
> ## GO
> parsGO1 <- new("GOHyperGParams",
>                universeGeneIds = geneUniverse,
>                geneIds = geneList1,
>                annotation = "org.Hs.eg.db",
>                conditional = TRUE,
>                testDirection = "over",
>                pvalueCutoff = 0.01)
> parsGO2 <- parsGO1
> geneIds(parsGO2) <- geneList2
>
> ## GO BP
> ontology(parsGO1) <- ontology(parsGO2) <- "BP"
> testGObp1 <- hyperGTest(parsGO1)
> testGObp2 <- hyperGTest(parsGO2)
> universeMappedCount(testGObp1) == universeMappedCount(testGObp2)
> ## [1] TRUE
>
> ## the same in CC and MF ontologies
>
> sessionInfo()
>
> R version 2.11.0 (2010-04-22)
> x86_64-unknown-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  tools     methods
> [8] base
>
> other attached packages:
> [1] org.Hs.eg.db_2.4.1   GOstats_2.14.0       RSQLite_0.9-0
> [4] DBI_0.2-5            graph_1.26.0         Category_2.14.0
> [7] AnnotationDbi_1.10.1 Biobase_2.8.0
>
> loaded via a namespace (and not attached):
> [1] annotate_1.26.0   genefilter_1.30.0 GO.db_2.4.1       GSEABase_1.10.0
> [5] RBGL_1.24.0       splines_2.11.0    survival_2.35-8   XML_3.1-0
> [9] xtable_1.5-6
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>