[BioC] Hypergeometric test with Disease Ontology

Gilbert Feng g-feng at northwestern.edu
Fri Jan 28 22:50:58 CET 2011


Hi, Vincent

That's a good question. Currently, DO.obo file contains some synonyms 
for each DO term/ID. Our collaborators are working on curation right 
now. I'll talk to them about this. So far, what I can think is to 
retrieve the synonyms from DO.obo and use some text mining tools to find 
the DO ids.

More info can be found at 
http://do-wiki.nubic.northwestern.edu/index.php/Main_Page  And there 
will be a DO workshop between 2/8 and 2/9 at Baltimore.

Best

Gilbert



On 1/28/11 2:25 PM, Vincent Carey wrote:
> This has been an interesting discussion.  A marginally connected
> question: Are there tools to normalize found lists of symptoms that
> use informal terminology so that they can be matched to formally
> catalogued terms in DO?
>
> On Fri, Jan 28, 2011 at 9:57 AM, Ted Morrow<ted.morrow at ebc.uu.se>  wrote:
>> Hi Gilbert,
>>
>> Great thanks for the clarification....OK building myGA works fine and I have
>> setAnnLib to org.Hs.eg.db with:
>> myGA2<- setAnnLib(myGA, 'org.Hs.eg.db')
>> getAnnLib(myGA2)
>> [1] "org.Hs.eg.db"
>>
>> .....but at the geneAnswerReadable step:
>> myGAreadable<- geneAnswersReadable(myGA2, catTerm=FALSE)
>>
>> I get the following error:
>> [1] "Mapping geneInput ..."
>> [1] "Warning: some genes do not have valid symbols!"
>> [1] "Mapping genesInCategory ..."
>>
>>   traceback()
>> 2: stop(paste("The input geneAnswers categoryType is not DOLite but ",
>>        x at categoryType, ". stop function!"))
>> 1: topDOLiteGenes(myGA, orderby = "pvalue", topGenes = "ALL", genesOrderBy =
>> "pValue",
>>        file = TRUE)
>>
>> Cheers
>> Ted
>>
>> sessionInfo()
>> R version 2.12.1 (2010-12-16)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>> States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>>                  LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>   [1] org.Hs.eg.db_2.4.6   GeneAnswers_1.6.0    RColorBrewer_1.0-2
>> Heatplus_1.20.0      rgl_0.92.798         MASS_7.3-9           RSQLite_0.9-4
>>         DBI_0.2-5            XML_3.2-0.2          annotate_1.28.0
>>   AnnotationDbi_1.12.0
>> [12] Biobase_2.10.0       RCurl_1.5-0.1        bitops_1.0-4.1
>> igraph_0.5.5-1
>>
>> loaded via a namespace (and not attached):
>> [1] graph_1.28.0     grid_2.12.1      RBGL_1.26.0      Rgraphviz_1.27.0
>> tools_2.12.1     xtable_1.5-6
>>>
>>
>>
>>
>> On 1/28/2011 3:40 PM, Gilbert Feng wrote:
>>>
>>> Hi, Ted
>>>
>>> Yes, your codes about mygenepool and mygenes are correct. I didn't clearly
>>> state them in my previous email. Sorry about that. myDOLite should be built
>>> by DOLite and mygenepool, not mygenes. So currently, myDOLite is a
>>> customized annotation library and when you build a geneAnswers instance,
>>> leave categoryType to NULL(default value), and in your case, I think the
>>> total gene number should be 1310, not 1202, since the former is the gene
>>> number of mygenepool.
>>>
>>> after buiding myGA, run setAnnLib to set it to org.Hs.eg.db I guess the
>>> error is that you don't set it.
>>>
>>> For the further mapping function, like geneAnswerReadable or other
>>> visualization, always set catTerm=FALSE.
>>>
>>> Let me know if it doesn't work.
>>>
>>> Best
>>>
>>> Gilbert
>>>
>>>
>>>> I have two gene pools - one is a universe of genes (called say
>>>> "mygenepool") which is only a subset on all human genes, the other is a
>>>> subset of mygenepool (called say "mygenes"). I am interested in knowing
>>>> which DO terms show over-representation in mygenes given the DO terms
>>>> present in mygenepool. So, in your example I am a bit confused as to
>>>> which pool "mygenes" refers to. Is it "mygenepool" or "mygenes"?
>>>>
>>>> So I have......
>>>> str(mygenepool) # I used as.matrix then as.vector to change the
>>>> dataframe into a vector...is that OK?
>>>> int [1:1310] 8813 3075 2729 8379 204 170302 10165 6521 799 3052 ...
>>>>
>>>> and
>>>> str(mygenes)
>>>> int [1:1202] 6833 10083 60528 8842 5048 4843 6329 5080 6401 6853 ...
>>>>
>>>>
>>>> and mygenes is a subset of mygenepool:
>>>> overlap<- intersect(mygenepool, mygenes)
>>>> str(overlap)
>>>> int [1:1202] 6833 10083 60528 8842 5048 4843 6329 5080 6401 6853 ...
>>>>
>>>>
>>>> Running the first bit of your example code seems to work fine:
>>>> myDOLite<- lapply(DOLite, intersect, mygenepool)
>>>> names(myDOLite)<- DOLiteTerm[names(myDOLite)]
>>>>
>>>> and so does creating a GeneAnswers instance (myGA) whether I specify the
>>>> totalGeneNumber or not:
>>>> myGA<- geneAnswersBuilder(mygenes, myDOLite, categoryType='DOLITE',
>>>> testType='hyperG',
>>>> #known=FALSE, totalGeneNumber=1202,
>>>> pvalueT=0.1, geneExpressionProfile=NULL, verbose=TRUE)
>>>>
>>>> But then I get the following error:
>>>> myGAreadable<- geneAnswersReadable(myGA, verbose=TRUE)
>>>> [1] "Mapping geneInput ..."
>>>> Error in switch(sub("org.*[:.:]", "", libname), eg = "EG", tair = "TAIR",
>>>> :
>>>> EXPR must be a length 1 vector
>>>>
>>>> traceback()
>>>> 2: getSymbols(as.vector(x at geneInput[, 1]), x at annLib, strict = strict,
>>>> missing = missing)
>>>> 1: geneAnswersReadable(myGA, verbose = TRUE)
>>>>
>>>> If I instead create the GeneAnswers instance with 'org.Hs.eg.db' and not
>>>> myDOLite the step.....
>>>> myGAreadable<- geneAnswersReadable(myGA, verbose=TRUE)
>>>>
>>>> .....works fine as do later steps (for example geneAnsersChartPlots and
>>>> topDOLiteGenes)
>>>>
>>>> But then presumably I am not working with the restricted pool of genes
>>>> that I want to (i.e. mygenepool)
>>>>
>>>> Any ideas?
>>>> Cheers
>>>> Ted
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/27/2011 5:19 PM, Gilbert Feng wrote:
>>>>>
>>>>> Hi, Ted
>>>>>
>>>>> If I correctly understand your question, you have your own gene pool,
>>>>> don't you? In that case, GeneAnswers still can handle that. What you
>>>>> need is to build a customized DOLite list to run analysis. For example,
>>>>>
>>>>> if your genes are mygenes(an unique Entrez GENE ID vector)
>>>>>
>>>>> myDOLite<- lapply(DOLite, intersect, mygenes)
>>>>> names(myDOLite)<- DOLiteTerm[names(myDOLite)]
>>>>> #Please note that the parameter 'known' is TRUE in geneAnswersBuilder,
>>>>> which means the test will only use shared genes in mygenes and DOLite
>>>>> genes as the gene pool to run hypergeometric test. You can change the
>>>>> number of gene pool by set parameter 'totalGeneNumber' and set 'known'
>>>>> to FALSE at the same time .
>>>>> myGA<- geneAnswersBuilder(mygenes, myDOLite, testType='hyperG', ...)
>>>>>
>>>>> since DO is designed for human, you can use setAnnLib to set the
>>>>> species as 'org.Hs.eg.db' for further visualization or analysis.
>>>>>
>>>>> Let me know if you have any question.
>>>>>
>>>>> Gilbert
>>>>>
>>>>> BTW, the current one has a new function to automatically generate a
>>>>> report for a or several groups of genes. Please check "groupReport"
>>>>> page, but before you run it, make sure you have java support and flash
>>>>> installation.
>>>>>
>>>>>
>>>>> On 1/27/11 9:45 AM, Ted Morrow wrote:
>>>>>>
>>>>>> Thanks for your replies Gilbert and Peter.....I forgot to say in my
>>>>>> original question that I did look into using both the GeneAnswers
>>>>>> package and the FunDO web-based system but the issue I have not been
>>>>>> able to solve with them is that I cannot specify "my gene Universe"
>>>>>> within which my selected genes are tested for enrichment. I want to see
>>>>>> which diseases are enriched out of the possible diseases in "my gene
>>>>>> universe" which is a subset of all possible human genes.
>>>>>>
>>>>>> Is that possible in GeneAnswers (or even FunDO)?
>>>>>>
>>>>>> Cheers
>>>>>> Ted
>>>>>>
>>>>>> On 1/27/2011 4:39 PM, Peter Robinson wrote:
>>>>>>>
>>>>>>> On 01/27/2011 04:17 PM, Gilbert Feng wrote:
>>>>>>>>
>>>>>>>> Hi, Ted
>>>>>>>>
>>>>>>>> DO.db is a standard sqlite file and you can use standard RSQLite
>>>>>>>> procedure to retrieve the information. Actually we do have a lite
>>>>>>>> version of Disease Ontology, DOLite that removes some redundant
>>>>>>>> nodes,
>>>>>>>> integrated in GeneAnswers package, which also uses hypergeometric
>>>>>>>> test
>>>>>>>> to run enrichment analysis as well as automatically generated
>>>>>>>> interactive(cytoscape web support) html summary for one or more
>>>>>>>> groups
>>>>>>>> of genes.
>>>>>>>>
>>>>>>>> Best
>>>>>>>>
>>>>>>>> Gilbert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You might also want to take a look at this website:
>>>>>>> http://django.nubic.northwestern.edu/fundo/faq
>>>>>>> which implements enrichment analysis using genes annotated to DO
>>>>>>> terms.
>>>>>>> -peter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On 1/27/11 5:00 AM, Ted Morrow wrote:
>>>>>>>>>
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> I would like to conduct a hypergeometric test on a list of genes but
>>>>>>>>> using Disease Ontology instead of GO or KEGG terms. The package
>>>>>>>>> "DO.db"
>>>>>>>>> contains this information but I have been unable to find a way of
>>>>>>>>> using
>>>>>>>>> this database in conjunction with the "GOstats" package that I have
>>>>>>>>> been
>>>>>>>>> using.
>>>>>>>>>
>>>>>>>>> Has anyone attempted to run a hypergeometric test for Disease
>>>>>>>>> Ontology
>>>>>>>>> terms? Is there another package I could use? Or is there a way of
>>>>>>>>> modifying the argument (if that's the right word) GOHyperGParams in
>>>>>>>>> GOstats so that it can make use of the information in DO.db?
>>>>>>>>>
>>>>>>>>> Both the GO and KEGG analyses work fine:
>>>>>>>>> ### GO chunk
>>>>>>>>> params<- new("GOHyperGParams",
>>>>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>>>>> annotation = "hgu95av2.db", ontology="BP",
>>>>>>>>> pvalueCutoff=0.01, conditional=FALSE,
>>>>>>>>> testDirection="over")
>>>>>>>>> hgOver<- hyperGTest(params)
>>>>>>>>>
>>>>>>>>> hgOver
>>>>>>>>> Gene to GO BP test for over-representation
>>>>>>>>> 2136 GO BP ids tested (15 have p<  0.01)
>>>>>>>>> Selected gene set size: 112
>>>>>>>>> Gene universe size: 951
>>>>>>>>> Annotation package: hgu95av2
>>>>>>>>>
>>>>>>>>> ### KEGG chunk
>>>>>>>>> paramsKEGG<- new("KEGGHyperGParams",
>>>>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>>>>> annotation = "hgu95av2.db",
>>>>>>>>> pvalueCutoff=0.01,
>>>>>>>>> testDirection="over")
>>>>>>>>>
>>>>>>>>> hgOverKEGG<- hyperGTest(paramsKEGG)
>>>>>>>>> hgOverKEGG
>>>>>>>>> summary(hgOverKEGG)
>>>>>>>>>
>>>>>>>>> My data looks like this:
>>>>>>>>> str(selectedEntrezIds)
>>>>>>>>> chr [1:157] "60528" "6853" "10157" "5081" "389434" "6591" "7414"
>>>>>>>>> "7546"
>>>>>>>>> "3074" "6916" "6559" "23503" "8626" "6557" "38" "60" "9733" "113235"
>>>>>>>>> "28962" "10269" "4069" "30835" "7018" ...
>>>>>>>>>
>>>>>>>>>> str(entrezUniverse)
>>>>>>>>> chr [1:1310] "8813" "3075" "2729" "8379" "204" "170302" "10165"
>>>>>>>>> "6521"
>>>>>>>>> "799" "3052" "1387" "5244" "3674" "6833" "10083" "60528" "8842"
>>>>>>>>> "5048"
>>>>>>>>> "4843" "6329" "5080" "6401" "6853" ...
>>>>>>>>>
>>>>>>>>> My naive attempts to use DO have included:
>>>>>>>>> paramsDO<- new("DOHyperGParams",
>>>>>>>>> geneIds = selectedEntrezIds, universeGeneIds = entrezUniverse,
>>>>>>>>> annotation = "DO.db",
>>>>>>>>> pvalueCutoff=0.01,
>>>>>>>>> testDirection="over")
>>>>>>>>>
>>>>>>>>> Which of course doesn't work and gives the following error:
>>>>>>>>> Error in getClass(Class, where = topenv(parent.frame())) :
>>>>>>>>> "DOHyperGParams" is not a defined class
>>>>>>>>>
>>>>>>>>>> traceback()
>>>>>>>>> 3: stop(gettextf("\"%s\" is not a defined class", Class), domain =
>>>>>>>>> NA)
>>>>>>>>> 2: getClass(Class, where = topenv(parent.frame()))
>>>>>>>>> 1: new("DOHyperGParams", geneIds = selectedEntrezIds,
>>>>>>>>> universeGeneIds =
>>>>>>>>> entrezUniverse,
>>>>>>>>> annotation = "DO.db", pvalueCutoff = 0.01, testDirection = "over")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Replacing "GOHyperGParams" with "DOHyperGParams" also gives the
>>>>>>>>> following error:..
>>>>>>>>>
>>>>>>>>> hgOverDO<- hyperGTest(paramsDO)
>>>>>>>>> Error in match.arg(ontology, c("BP", "CC", "MF")) :
>>>>>>>>> 'arg' should be one of “BP”, “CC”, “MF”
>>>>>>>>>
>>>>>>>>> traceback()
>>>>>>>>> 10: stop(gettextf("'arg' should be one of %s",
>>>>>>>>> paste(dQuote(choices),
>>>>>>>>> collapse = ", ")), domain = NA)
>>>>>>>>> 9: match.arg(ontology, c("BP", "CC", "MF"))
>>>>>>>>> 8: getUniverseViaGo(p)
>>>>>>>>> 7: universeBuilder(p)
>>>>>>>>> 6: universeBuilder(p)
>>>>>>>>> 5: .hyperGTestInternal(p)
>>>>>>>>> 4: is(object, Cl)
>>>>>>>>> 3: is(object, Cl)
>>>>>>>>> 2: .valueClassTest(standardGeneric("hyperGTest"),
>>>>>>>>> "HyperGResultBase",
>>>>>>>>> "hyperGTest")
>>>>>>>>> 1: hyperGTest(paramsDO)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Any help would be greatly appreciated.
>>>>>>>>> /Ted
>>>>>>>>>
>>>>>>>>>> sessionInfo()
>>>>>>>>> R version 2.12.1 (2010-12-16)
>>>>>>>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>>>>>>>>
>>>>>>>>> locale:
>>>>>>>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>>>>>>>> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>>>>>>>>> LC_TIME=English_United States.1252
>>>>>>>>>
>>>>>>>>> attached base packages:
>>>>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>>>>
>>>>>>>>> other attached packages:
>>>>>>>>> [1] KEGG.db_2.4.5 DO.db_2.1.0 GO.db_2.4.5 hgu95av2.db_2.4.5
>>>>>>>>> org.Hs.eg.db_2.4.6 GOstats_2.16.0 RSQLite_0.9-4 DBI_0.2-5
>>>>>>>>> graph_1.28.0
>>>>>>>>> Category_2.16.0 AnnotationDbi_1.12.0
>>>>>>>>> [12] Biobase_2.10.0
>>>>>>>>>
>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>> [1] annotate_1.28.0 genefilter_1.32.0 GSEABase_1.12.2 RBGL_1.26.0
>>>>>>>>> splines_2.12.1 survival_2.36-2 tools_2.12.1 XML_3.2-0.2 xtable_1.5-6
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> --
>> __________________________________________________________________________________
>>
>> Edward H. Morrow
>> Department of Ecology and Evolution
>> Animal Ecology (zooekologi)
>> Evolutionary Biology Centre
>> Uppsala University
>> Norbyvägen 18-D
>> SE-752 36 Uppsala
>> SWEDEN
>>
>> Email: ted.morrow at ebc.uu.se
>> Tel: +46 18 471 2676
>> Fax +46 18 471 6484
>> Webpage:
>> http://www.iee.uu.se/zooekol/default.php?type=personalpage&id=119&lang=en
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>

-- 
-----------------------------------------------
Gang (Gilbert) Feng, PhD
Biomedical Informatics Center
Robert H. Lurie Comprehensive Cancer Center
Northwestern University
750 N. Lake Shore Drive, 11th Floor(11-175e)
Chicago, IL  60611
Phone:312-503-2358
Email g-feng (at) northwestern.edu



More information about the Bioconductor mailing list