[BioC] GOstats: Listing genes from hyperGTest

Fri Oct 24 15:44:04 CEST 2008

Hi Tim,

Hmm. That's weird.

First off, I think you want to be using the GO2ALLEGS mapping rather 
than the GO2EG (or revmap(org.Hs.egGO) which is equivalent).

The reason for this is that the GO ontology is a directed acyclic graph 
where all children are subsets of their parents, so by definition all 
genes that are appended with a child GO term are also appended with the 
parent term (e.g., multicellular organismal development is a part of the 
developmental process, so any genes involved with the former are by 
default involved with the latter as well).

So modulo differences in our BioC versions, I get the same results as you:

 > summary(hyp, categorySize=2000)
       GOBPID       Pvalue OddsRatio  ExpCount Count Size
1 GO:0032501 0.0001909763  3.000224 11.303450    23 3420
2 GO:0032502 0.0008306266  2.730994  9.984714    20 3021
3 GO:0007275 0.0035077569  2.576107  7.224954    15 2186
4 GO:0007154 0.0042870538  2.271198 13.058459    22 3951
                                   Term
1     multicellular organismal process
2                developmental process
3 multicellular organismal development
4                   cell communication

And can get the genes using org.Hs.egGO2ALLEGS:

 > geneIds[geneIds %in% get("GO:0032501", org.Hs.egGO2ALLEGS)]
  [1] "1281"   "1410"   "157506" "1906"   "2625"   "3043"   "3553" 
"4015"
  [9] "4885"   "4886"   "5021"   "5156"   "5788"   "6005"   "6507" 
"673"
[17] "6876"   "695"    "7026"   "7070"   "7412"   "800"    "83890"

What I don't understand is how multicellular organismal process and 
developmental process can be missing from the org.Hs.egGO map

Both these processes are direct children of biological process, so 
really they should be there. Maybe Marc Carlson can shed some light on this.

Best,

Jim

Tim Smith wrote:
> Thanks Jim. I'm still having problems, i.e., I cannot find which subset of input genes resulted in the significant GO term. I have reproduced the problem that I am having:
> 
> -----------------------------------------------------
> library(org.Hs.eg.db)
> library(GOstats)
> library(GO.db)
> 
> # Set of genes (Entrez Ids) that I want to investigate
> geneIds <- c("10406",  "10418",  "11082",  "1281",   "1410",   "157506", "167410", "1906" ,  "2029",   "23091",  "2625" ,  "2823" ,  "2877",   "2993",   "3039" ,  "3043",  
>  "3046",   "3283",   "3553",   "4015",   "4069",   "4258",   "4345",   "4353",   "4885",   "4886",   "5021",   "5055",   "5151",   "5156",   "5320",   "5553",  
>  "55885",  "56667",  "5788" ,  "5999",   "6005",   "629",    "6507", "653145", "6590",   "673",    "6876",   "695",    "7026",   "7070",   "7103",   "7412",  
>  "760",    "7738",   "800",    "828",    "83890",  "945",    "963")  
> 
> ### I have reproduced Jim's code for the test
> univ <- Lkeys(org.Hs.egGO)
> param <- new("GOHyperGParams", geneIds = geneIds, universeGeneIds=univ, annotation="org.Hs.eg.db", ontology="BP")
> hyp <- hyperGTest(param)
> summary(hyp,categorySize=2000)
> ----------------------------------------------------
> 
> The result I get is :
> 
>       GOBPID       Pvalue OddsRatio  ExpCount Count Size                                 Term
> 1 GO:0032501 0.0001568145  3.021968 11.976486    24 3438     multicellular organismal process
> 2 GO:0032502 0.0012459810  2.626265 10.297409    20 2956                developmental process
> 3 GO:0007275 0.0051457709  2.457897  7.517527    15 2158 multicellular organismal development
> 4 GO:0007154 0.0061081457  2.187407 13.418681    22 3852                   cell communication
> 
>  I now want to know which subset of genes resulted in 'GO:0032501'. The error I get is:
> 
>> geneIds[geneIds %in% get("GO:0032501", revmap(org.Hs.egGO))]
> Error in .checkKeys(value, Rkeys(x), x at ifnotfound) : 
>   value for "GO:0032501" not found
> 
>> geneIds[geneIds %in% get("GO:0032502", revmap(org.Hs.egGO))]
> Error in .checkKeys(value, Rkeys(x), x at ifnotfound) : 
>   value for "GO:0032502" not found
> 
> [[elided Yahoo spam]]
> 
> thanks a lot.
> 
> Tim
> 
> 
> 
> 
> 
> Hi Tim, 
> 
> Yeah, probeSetSummary() is probably not what you want, if you are not 
> starting with an Affy chip. There are some gymnastics required to map 
> things back to the original Affy chip that you won't need to do. In 
> addition, if you are not using a conditional hypergeometric analysis, it 
> should be pretty simple to get what you want without even needing to 
> parse things out of the GOHyperGResult object. An example: 
> 
> ## fake up some data 
> 
>> geneIds <- Lkeys(org.Hs.egGO)[sample(1:5000, 500)] 
>> univ <- Lkeys(org.Hs.egGO) 
>> param <- new("GOHyperGParams", geneIds = geneIds, 
> universeGeneIds=univ, annotation="org.Hs.eg.db", ontology="BP") 
>> hyp <- hyperGTest(param) 
>> summary(hyp, categorySize=10) 
>       GOBPID      Pvalue OddsRatio   ExpCount Count Size 
>   Term 
> 1 GO:0007338 0.002723500  29.25101 0.07808304     2   54 single 
> fertilization 
> 2 GO:0009566 0.002925855  28.16374 0.08097501     2   56 
> fertilization 
> 
> So we have two terms of interest. Getting the Entrez Gene IDs from the 
> input set that map to these terms is easy: 
> 
>> geneIds[geneIds %in% get("GO:0007338", revmap(org.Hs.egGO))] 
> [1] "100131137" "10007" 
> 
> Now you might also want to know which 54 Entrez Gene IDs map to that 
> particular GO term. Since you are not conditioning, this includes that 
> particular GO term and all its offspring. 
> 
>> offspring <- get("GO:0007338", GOBPOFFSPRING) 
>> egids <- unique(unlist(mget(c("GO:0007338", offspring), 
> revmap(org.Hs.egGO), ifnotfound=NA), use.names=FALSE)) 
>> egids[!is.na(egids)] 
>  [1] "1047"      "4179"      "4240"      "4486"      "4809"      "5016" 
>  [7] "6674"      "7783"      "7784"      "7802"      "7993"      "8747" 
> [13] "8748"      "8852"      "9082"      "10007"     "10361"     "22917" 
> [19] "26476"     "53340"     "57055"     "57829"     "64100"     "93185" 
> [25] "158062"    "442868"    "100131137" "49"        "410"       "2683" 
> [31] "3010"      "4184"      "6677"      "7142"      "7455"      "8857" 
> [37] "11055"     "124626"    "2054"      "2741"      "10343"     "10566" 
> [43] "27297"     "152015"    "3074"      "167"       "928"       "2515" 
> [49] "5104"      "23553"     "284359"    "164684"    "7141"      "79400" 
> 
> Best, 
> 
> Jim 
> 
> 
> Tim Smith wrote: 
> 
> Thanks James. If I can tweak that function, I'll get exactly what I want. 
> I tried what you suggested and got the following error: 
> 
> --------------------------- 
>  ### 'genes1' are the Entrez IDs of my genes of interest, and 'allGenes' is the universe of Entrez IDs 
>   
>   paramsGO <- new("GOHyperGParams", geneIds = genes1, 
>            universeGeneIds = allGenes, annotation = "org.Hs.eg.db", 
>            ontology = "BP", pvalueCutoff = 1, conditional = FALSE, 
>            testDirection = "over") 
>   
>  GO <- hyperGTest(paramsGO) 
>  ps <- probeSetSummary(GO) 
> 
> Error in get(mapName, envir = pkgEnv, inherits = FALSE) : 
>   variable "org.Hs.egENTREZID" was not found 
> -------------------------------- 
> 
> I guess the function would return the probe ids if I was using them, but I have Entrez IDs as input. 
> 
> Or am I doing something wrong? 
> 
> thanks! 
> 
> 
> 
> 
> 
> ----- Original Message ---- 
> From: James W. MacDonald <jmacdon at med.umich.edu> 
> 
> Cc: bioc <bioconductor at stat.math.ethz.ch> 
> Sent: Wednesday, October 22, 2008 9:10:39 AM 
> Subject: Re: [BioC] GOstat: listing genes from hyperGTest 
> 
> Hi Tim, 
> 
> Does probeSetSummary() do what you want? 
> 
> Best, 
> 
> Jim 
> 
> 
> 
> Tim Smith wrote: 
> 
>   
> Hi, 
> 
> I 
> was performing a hyperGTest for genes in homo-sapiens. For a set of 
> input genes, this function returns some 'significant' GO terms. What I 
> wanted to now do was to co-relate each significant GO term (returned by 
> this function) with genes (from my set of input genes) associated with 
> that GO term. However, I think that I may be using the wrong 
> package/function to get the releveant set of genes. 
> 
> Currently, what I'm doing is finding the significant GO terms by using the following code: 
> 
> ----------------------- 
> ### 'genes1' are the Entrez IDs of my genes of interest, and 'allGenes' is the universe of Entrez IDs 
>  paramsGO <- new("GOHyperGParams", geneIds = genes1, 
>           universeGeneIds = allGenes, annotation = "org.Hs.eg.db", 
>           ontology = "BP", pvalueCutoff = 1, conditional = FALSE, 
>           testDirection = "over") 
> 
> GO <- hyperGTest(paramsGO) 
> -------------------------- 
> This 
> gives me a set of significant GO terms. Now, I would like to find which 
> subset of genes in 'genes1' is associated with each of the significant 
> GO term. To do this I map all GO terms to their Entrez IDs using the 
> 'org.Hs.eg.db' package using the following: 
> 
> xx <- as.list(org.Hs.egGO2EG) 
> 
> to 
> get a mapping of GO terms to Entrez IDs. I get 6,756 GO terms (isn't 
> this number small?) that map to at least one Entrez ID. So, from here I 
> look up which Entrez IDs are associated with my GO term of interest. 
> 
> My 
> problem is that often, the GO term from hyperGTest is not associated 
> with any Entrez ID (using xx <- as.list(org.Hs.egGO2EG) described 
> above ), i.e. the GO term/ID is not in the list obtained from 
> 'org.Hs.egGO2EG'). For example, the term 'GO:0043284' is thrown up by 
> hyperGTest, but does not appear to be associated with any Entrez IDs in 
> the org.Hs.eg.db package. Where could I be going wrong? 
> 
> [[elided Yahoo spam]] 
> 
> Thanks for any comments/suggestions. I realize that I'm probably doing something really stupid here.... 
> 
> My sessionInfo() is: 
> -------------------------------- 
> R version 2.7.2 (2008-08-25) 
> i386-pc-mingw32 
> locale: 
> LC_COLLATE=English_United 
> States.1252;LC_CTYPE=English_United 
> States.1252;LC_MONETARY=English_United 
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 
> 
> attached base packages: 
>  [1] grid      splines   tools     stats     graphics  grDevices utils     datasets  methods   base    
> other attached packages: 
>  [1] 
> gplots_2.6.0         gmodels_2.14.1       gtools_2.4.0       
> gdata_2.4.1          Rgraphviz_1.18.1     GOstats_2.6.0     
> Category_2.6.0       [8] RBGL_1.16.0          annotate_1.18.0   
> xtable_1.5-2         graph_1.18.0         PFAM.db_2.2.0     
> GO.db_2.2.0          KEGG.db_2.2.0      [15] org.Hs.eg.db_2.2.0  
> AnnotationDbi_1.2.0  RSQLite_0.6-8        DBI_0.2-4           
> genefilter_1.20.0    survival_2.34-1      affy_1.18.0        [22]
> preprocessCore_1.2.0 affyio_1.8.0         Biobase_2.0.0      
> loaded via a namespace (and not attached): 
> [1] cluster_1.11.11 MASS_7.2-44    
> 
> --------------------------------- 
> 
> 
>      
>     [[alternative HTML version deleted]] 
> 
> _______________________________________________ 
> Bioconductor mailing list 
> Bioconductor at stat.math.ethz.ch 
> https://stat.ethz.ch/mailman/listinfo/bioconductor 
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor 
> 
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-0646
734-936-8662