[BioC] Hypergeometric test with Disease Ontology

Fri Jan 28 14:34:41 CET 2011

I forgot to add my sessionInfo for the last post...
sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
States.1252    LC_MONETARY=English_United States.1252 
LC_NUMERIC=C                           LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
  [1] org.Hs.eg.db_2.4.6   GeneAnswers_1.6.0    RColorBrewer_1.0-2   
Heatplus_1.20.0      rgl_0.92.798         MASS_7.3-9           
RSQLite_0.9-4        DBI_0.2-5            XML_3.2-0.2          
annotate_1.28.0      AnnotationDbi_1.12.0
[12] Biobase_2.10.0       RCurl_1.5-0.1        bitops_1.0-4.1       
igraph_0.5.5-1

loaded via a namespace (and not attached):
[1] graph_1.28.0     grid_2.12.1      RBGL_1.26.0      Rgraphviz_1.27.0 
tools_2.12.1     xtable_1.5-6
 >

On 1/28/2011 2:25 PM, Ted Morrow wrote:
> Hi Gilbert,
>
> Thanks for the response....it seems like GeneAnswers may do the job 
> but I am still having trouble getting it to work for me.
>
> I have two gene pools - one is a universe of genes (called say 
> "mygenepool") which is only a subset on all human genes, the other is 
> a subset of mygenepool (called say "mygenes"). I am interested in 
> knowing which DO terms show over-representation in mygenes given the 
> DO terms present in mygenepool. So, in your example I am a bit 
> confused as to which pool "mygenes" refers to. Is it "mygenepool" or 
> "mygenes"?
>
> So I have......
> str(mygenepool) # I used as.matrix then as.vector to change the 
> dataframe into a vector...is that OK?
> int [1:1310] 8813 3075 2729 8379 204 170302 10165 6521 799 3052 ...
>
> and
> str(mygenes)
> int [1:1202] 6833 10083 60528 8842 5048 4843 6329 5080 6401 6853 ...
>
>
> and mygenes is a subset of mygenepool:
> overlap <-  intersect(mygenepool, mygenes)
> str(overlap)
> int [1:1202] 6833 10083 60528 8842 5048 4843 6329 5080 6401 6853 ...
>
>
> Running the first bit of your example code seems to work fine:
> myDOLite <- lapply(DOLite, intersect, mygenepool)
> names(myDOLite) <- DOLiteTerm[names(myDOLite)]
>
> and so does creating a GeneAnswers instance (myGA) whether I specify 
> the totalGeneNumber or not:
> myGA <- geneAnswersBuilder(mygenes, myDOLite, categoryType='DOLITE', 
> testType='hyperG',
>             #known=FALSE, totalGeneNumber=1202,
>             pvalueT=0.1, geneExpressionProfile=NULL, verbose=TRUE)
>
> But then I get the following error:
> myGAreadable <- geneAnswersReadable(myGA, verbose=TRUE)
> [1] "Mapping geneInput ..."
> Error in switch(sub("org.*[:.:]", "", libname), eg = "EG", tair = 
> "TAIR",  :
>   EXPR must be a length 1 vector
>
> traceback()
> 2: getSymbols(as.vector(x at geneInput[, 1]), x at annLib, strict = strict,
>        missing = missing)
> 1: geneAnswersReadable(myGA, verbose = TRUE)
>
> If I instead create the GeneAnswers instance with 'org.Hs.eg.db' and 
> not myDOLite the step.....
> myGAreadable <- geneAnswersReadable(myGA, verbose=TRUE)
>
> .....works fine as do later steps (for example geneAnsersChartPlots 
> and topDOLiteGenes)
>
> But then presumably I am not working with the restricted pool of genes 
> that I want to (i.e. mygenepool)
>
> Any ideas?
> Cheers
> Ted
>
>
>
>
>
>
> On 1/27/2011 5:19 PM, Gilbert Feng wrote:
>> Hi, Ted
>>
>> If I correctly understand your question, you have your own gene pool, 
>> don't you? In that case, GeneAnswers still can handle that. What you 
>> need is to build a customized DOLite list to run analysis. For example,
>>
>> if your genes are mygenes(an unique Entrez GENE ID vector)
>>
>> myDOLite <- lapply(DOLite, intersect, mygenes)
>> names(myDOLite) <- DOLiteTerm[names(myDOLite)]
>> #Please note that the parameter 'known' is TRUE in 
>> geneAnswersBuilder, which means the test will only use shared genes 
>> in mygenes and DOLite genes as the gene pool to run hypergeometric 
>> test. You can change the number of gene pool by set parameter 
>> 'totalGeneNumber' and set 'known' to FALSE at the same time .
>> myGA <- geneAnswersBuilder(mygenes, myDOLite, testType='hyperG', ...)
>>
>> since DO is designed for human, you can use setAnnLib to set the 
>> species as 'org.Hs.eg.db' for further visualization or analysis.
>>
>> Let me know if you have any question.
>>
>> Gilbert
>>
>> BTW, the current one has a new function to automatically generate a 
>> report for a or several groups of genes. Please check "groupReport" 
>> page, but before you run it, make sure you have java support and 
>> flash installation.
>>
>>
>> On 1/27/11 9:45 AM, Ted Morrow wrote:
>>> Thanks for your replies Gilbert and Peter.....I forgot to say in my
>>> original question that I did look into using both the GeneAnswers
>>> package and the FunDO web-based system but the issue I have not been
>>> able to solve with them is that I cannot specify "my gene Universe"
>>> within which my selected genes are tested for enrichment. I want to see
>>> which diseases are enriched out of the possible diseases in "my gene
>>> universe" which is a subset of all possible human genes.
>>>
>>> Is that possible in GeneAnswers (or even FunDO)?
>>>
>>> Cheers
>>> Ted
>>>
>>> On 1/27/2011 4:39 PM, Peter Robinson wrote:
>>>> On 01/27/2011 04:17 PM, Gilbert Feng wrote:
>>>>> Hi, Ted
>>>>>
>>>>> DO.db is a standard sqlite file and you can use standard RSQLite
>>>>> procedure to retrieve the information. Actually we do have a lite
>>>>> version of Disease Ontology, DOLite that removes some redundant 
>>>>> nodes,
>>>>> integrated in GeneAnswers package, which also uses hypergeometric 
>>>>> test
>>>>> to run enrichment analysis as well as automatically generated
>>>>> interactive(cytoscape web support) html summary for one or more 
>>>>> groups
>>>>> of genes.
>>>>>
>>>>> Best
>>>>>
>>>>> Gilbert
>>>>
>>>>
>>>>
>>>>
>>>> You might also want to take a look at this website:
>>>> http://django.nubic.northwestern.edu/fundo/faq
>>>> which implements enrichment analysis using genes annotated to DO 
>>>> terms.
>>>> -peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> On 1/27/11 5:00 AM, Ted Morrow wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I would like to conduct a hypergeometric test on a list of genes but
>>>>>> using Disease Ontology instead of GO or KEGG terms. The package 
>>>>>> "DO.db"
>>>>>> contains this information but I have been unable to find a way of 
>>>>>> using
>>>>>> this database in conjunction with the "GOstats" package that I have
>>>>>> been
>>>>>> using.
>>>>>>
>>>>>> Has anyone attempted to run a hypergeometric test for Disease 
>>>>>> Ontology
>>>>>> terms? Is there another package I could use? Or is there a way of
>>>>>> modifying the argument (if that's the right word) GOHyperGParams in
>>>>>> GOstats so that it can make use of the information in DO.db?
>>>>>>
>>>>>> Both the GO and KEGG analyses work fine:
>>>>>> ### GO chunk
>>>>>> params<- new("GOHyperGParams",
>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>> annotation = "hgu95av2.db", ontology="BP",
>>>>>> pvalueCutoff=0.01, conditional=FALSE,
>>>>>> testDirection="over")
>>>>>> hgOver<- hyperGTest(params)
>>>>>>
>>>>>> hgOver
>>>>>> Gene to GO BP test for over-representation
>>>>>> 2136 GO BP ids tested (15 have p< 0.01)
>>>>>> Selected gene set size: 112
>>>>>> Gene universe size: 951
>>>>>> Annotation package: hgu95av2
>>>>>>
>>>>>> ### KEGG chunk
>>>>>> paramsKEGG<- new("KEGGHyperGParams",
>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>> annotation = "hgu95av2.db",
>>>>>> pvalueCutoff=0.01,
>>>>>> testDirection="over")
>>>>>>
>>>>>> hgOverKEGG<- hyperGTest(paramsKEGG)
>>>>>> hgOverKEGG
>>>>>> summary(hgOverKEGG)
>>>>>>
>>>>>> My data looks like this:
>>>>>> str(selectedEntrezIds)
>>>>>> chr [1:157] "60528" "6853" "10157" "5081" "389434" "6591" "7414" 
>>>>>> "7546"
>>>>>> "3074" "6916" "6559" "23503" "8626" "6557" "38" "60" "9733" "113235"
>>>>>> "28962" "10269" "4069" "30835" "7018" ...
>>>>>>
>>>>>> > str(entrezUniverse)
>>>>>> chr [1:1310] "8813" "3075" "2729" "8379" "204" "170302" "10165" 
>>>>>> "6521"
>>>>>> "799" "3052" "1387" "5244" "3674" "6833" "10083" "60528" "8842" 
>>>>>> "5048"
>>>>>> "4843" "6329" "5080" "6401" "6853" ...
>>>>>>
>>>>>> My naive attempts to use DO have included:
>>>>>> paramsDO<- new("DOHyperGParams",
>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>> annotation = "DO.db",
>>>>>> pvalueCutoff=0.01,
>>>>>> testDirection="over")
>>>>>>
>>>>>> Which of course doesn't work and gives the following error:
>>>>>> Error in getClass(Class, where = topenv(parent.frame())) :
>>>>>> "DOHyperGParams" is not a defined class
>>>>>>
>>>>>> > traceback()
>>>>>> 3: stop(gettextf("\"%s\" is not a defined class", Class), domain 
>>>>>> = NA)
>>>>>> 2: getClass(Class, where = topenv(parent.frame()))
>>>>>> 1: new("DOHyperGParams", geneIds = selectedEntrezIds, 
>>>>>> universeGeneIds =
>>>>>> entrezUniverse,
>>>>>> annotation = "DO.db", pvalueCutoff = 0.01, testDirection = "over")
>>>>>>
>>>>>>
>>>>>> Replacing "GOHyperGParams" with "DOHyperGParams" also gives the
>>>>>> following error:..
>>>>>>
>>>>>> hgOverDO<- hyperGTest(paramsDO)
>>>>>> Error in match.arg(ontology, c("BP", "CC", "MF")) :
>>>>>> 'arg' should be one of “BP”, “CC”, “MF”
>>>>>>
>>>>>> traceback()
>>>>>> 10: stop(gettextf("'arg' should be one of %s", 
>>>>>> paste(dQuote(choices),
>>>>>> collapse = ", ")), domain = NA)
>>>>>> 9: match.arg(ontology, c("BP", "CC", "MF"))
>>>>>> 8: getUniverseViaGo(p)
>>>>>> 7: universeBuilder(p)
>>>>>> 6: universeBuilder(p)
>>>>>> 5: .hyperGTestInternal(p)
>>>>>> 4: is(object, Cl)
>>>>>> 3: is(object, Cl)
>>>>>> 2: .valueClassTest(standardGeneric("hyperGTest"), 
>>>>>> "HyperGResultBase",
>>>>>> "hyperGTest")
>>>>>> 1: hyperGTest(paramsDO)
>>>>>>
>>>>>>
>>>>>> Any help would be greatly appreciated.
>>>>>> /Ted
>>>>>>
>>>>>> > sessionInfo()
>>>>>> R version 2.12.1 (2010-12-16)
>>>>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>>>>> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>>>>>> LC_TIME=English_United States.1252
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] KEGG.db_2.4.5 DO.db_2.1.0 GO.db_2.4.5 hgu95av2.db_2.4.5
>>>>>> org.Hs.eg.db_2.4.6 GOstats_2.16.0 RSQLite_0.9-4 DBI_0.2-5 
>>>>>> graph_1.28.0
>>>>>> Category_2.16.0 AnnotationDbi_1.12.0
>>>>>> [12] Biobase_2.10.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] annotate_1.28.0 genefilter_1.32.0 GSEABase_1.12.2 RBGL_1.26.0
>>>>>> splines_2.12.1 survival_2.36-2 tools_2.12.1 XML_3.2-0.2 xtable_1.5-6
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> .
>>>>>
>>>>
>>>>
>>>
>>
>

-- 
__________________________________________________________________________________

Edward H. Morrow
Department of Ecology and Evolution
Animal Ecology (zooekologi)
Evolutionary Biology Centre
Uppsala University
Norbyvägen 18-D
SE-752 36 Uppsala
SWEDEN

Email: ted.morrow at ebc.uu.se
Tel: +46 18 471 2676
Fax +46 18 471 6484
Webpage: http://www.iee.uu.se/zooekol/default.php?type=personalpage&id=119&lang=en