[BioC] GOstats::geneIdUniverse() and its relation to organism annotation db's
Marc Carlson
mcarlson at fhcrc.org
Wed Sep 8 19:36:25 CEST 2010
Hi Steve,
I can verify that GOstats uses "GO2ALL" mappings to do its
calculations. Speeding this sort of comparison up is actually a lot of
why these mappings are actually around.
Marc
On 09/07/2010 11:15 AM, Steve Lianoglou wrote:
> Hi Robert,
>
> In my hunt to find "the bug" I didn't stumble upon the difference
> between org.Sc.sgdGO and org.Sc.sgdGO2ALLORFS (I don't think I
> realized the latter one even existed).
>
> I'm guessing the problem I thought I saw was due to my oversight and
> is exactly as you say. Thanks for the thorough explanation and
> references.
>
> Cheers,
> -steve
>
> On Tue, Sep 7, 2010 at 9:16 AM, Robert M. Flight <rflight79 at gmail.com> wrote:
>
>> Hi Steve,
>>
>> I wonder if you are running into the problem that generally when you
>> query for the GO terms you get only those GO terms that the ORF is
>> directly annotated with (this would correspond to the actual
>> annotations on the GO website, for instance), however when calculating
>> enrichment, not only the directly annotated terms but also the parent
>> terms (less specific) are included.
>>
>> I'm thinking of the difference between "org.Sc.sgdGO", which returns
>> "org.Sc.sgdGO is an R object that provides mappings between ORF
>> identifiers and the GO identifiers that they are directly associated
>> with. This mapping and its reverse mapping do NOT associate the child
>> terms from the GO ontology with the gene. Only the directly evidenced
>> terms are represented here.", and a mapping that returns also the
>> other GO terms in the directed acyclic graph (up the tree
>> essentially). This would be the reverse of "org.Sc.sgdGO2ALLORFS".
>>
>> For example, any gene directly annotated to "GO:0007155 : cell
>> adhesion", will also by virtue of the DAG be indirectly annotated with
>> "GO:0022610 : biological adhesion", whether or not it is directly
>> annotated with that or not. When GO enrichment using GOHyperGTest is
>> calculated, all of the indirectly annotated terms are considered as
>> well. (see this link:
>> http://amigo.geneontology.org/cgi-bin/amigo/browse.cgi?action=plus_node&target=GO:0022610&open_1=GO:0008150,all,amigo1283864733)
>>
>> This seems to be the cause of your disparity below. You might want to
>> read a bit more about the GO and how enrichment calculations are done.
>> A good paper describing the nuts and bolts of most packages is Gavin
>> Sherlocks paper on GO::TermFinder
>> (http://www.ncbi.nlm.nih.gov/pubmed/15297299), which is the basis for
>> most other enrichment tools, including GOStats.
>>
>> If I do:
>>
>> crud <- mget("GO:0007059", envir=org.Sc.sgdGO2ALLORFS)
>> crud2 <- unlist(crud)
>> grep("YKL049C", crud2)
>>
>> I get 135 as the result.
>>
>> Cheers,
>>
>> -Robert
>>
>> Robert M. Flight, Ph.D.
>> Bioinformatics and Biomedical Computing Laboratory
>> University of Louisville
>> Louisville, KY
>>
>> PH 502-852-0467
>> EM robert.flight at louisville.edu
>> EM rflight79 at gmail.com
>>
>> Williams and Holland's Law:
>> If enough data is collected, anything may be proven by
>> statistical methods.
>>
>>
>>
>> On Mon, Sep 6, 2010 at 05:21, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've been trying to do some non-run-of-the-mill gene ontology analysis
>>> and have been messing around with the innards of the results we get
>>> from GOstats::hyperGTest.
>>>
>>> As a result, I've found some peculiarities in the GO annotations I'm
>>> retrieving from the hyperGTest and how they relate to the GO
>>> annotations in my organism's (saccharomyces cerevisiae) annotation
>>> database.
>>>
>>> In particular: for some enriched GO term X, I'm finding genes listed
>>> in the geneIdUniverse of X that do not have X as a member of their
>>> organism.db.GO map, and I'm not sure how that could be. I'm even
>>> setting condition=FALSE to my hyperGTest to make it as "vanilla" as
>>> possible (but as far as I understand the conditional test, this
>>> wouldn't effect what I'm seeing anyway since genes are removed from GO
>>> terms, and not added).
>>>
>>> I have a specific example below. You'll find that for the first
>>> significant GOID found (GO:0007059), gene "YKL049C" is listed as a
>>> member of this GOID's geneIdUniverse as it is returned from the
>>> hyperGTest, but GO:0007059 isn't listed in my annotation db GO mapping
>>> (org.Sc.sgdGO) for this gene, nor is "YKL049C" found in its GO2ORF
>>> map.
>>>
>>> === Example ===
>>>
>>> library(GOstats)
>>> library(org.Sc.sgd.db)
>>> library(annotate)
>>>
>>> orfs <- c(
>>> "YGR084C", "YDR409W", "YFR017C", "YDR342C", "YBR152W",
>>> "YIL014W", "YGR134W", "YDL109C", "YDL234C", "YDR524C",
>>> "YBR002C", "YGR063C", "YDR034C-A", "YIR011C",
>>> "YHR175W", "YHR109W", "YGR207C", "YCR023C", "YER039C-A",
>>> "YGL064C", "YGR249W", "YHR172W", "YGL226W", "YAL064W",
>>> "YDR309C", "YDR473C", "YAL058W", "YHL030W", "YKL101W",
>>> "YJL122W", "YBR121C", "YIL018W", "YDR443C", "YBR069C",
>>> "YBR011C", "YHR009C", "YGL166W", "YIL103W", "YDR206W",
>>> "YBL031W", "YBR162W-A", "YFL018C", "YBR214W", "YIL076W",
>>> "YBR291C", "YKL049C", "YIL149C", "YDR034W-B", "YDR395W")
>>>
>>> orf2go <- getAnnMap('GO', 'org.Sc.sgd')
>>> go2orf <- getAnnMap('GO2ORF', 'org.Sc.sgd')
>>> all.orfs <- keys(orf2go)
>>>
>>> ## Do GOstats test to get significant GO ontologies
>>> params <- new("GOHyperGParams", geneIds=orfs,
>>> universeGeneIds=all.orfs, annotation='org.Sc.sgd',
>>> ontology='BP', pvalueCutoff=0.05,
>>> conditional=FALSE, testDirection='over')
>>> GO <- hyperGTest(params)
>>>
>>> ## Look at the genes annotated in the universe for each GOID and compare
>>> ## with annotations for each gene in org.Sc.sgdGO
>>> universes <- geneIdUniverse(GO)
>>> uid <- names(universes)[1] ## GO:0007059
>>> uid.members <- universes[[uid]]
>>> "YKL049C" %in% uid.members ## TRUE
>>>
>>> ## Is the gene listed in GOID's (uid) universe annotated with that GOID?
>>> ## Given previous result, this *should* be TRUE
>>> go.ids <- orf2go[["YKL049C"]]
>>> uid %in% names(go.ids) ## FALSE
>>>
>>>
>>> ## Does that GOID have this gene as its member?
>>> ## This also should be TRUE
>>> check.orfs <- go2orf[[uid]]
>>> "YKL049C" %in% check.orfs ## FALSE
>>>
>>> ======================================
>>>
>>> Since `"YKL049C" %in% uid.members` is TRUE, I would expect the go term
>>> `uid` (GO:0007059) to be in both `names(go.ids)` and `check.orfs`, but
>>> it's not.
>>>
>>> Can someone shed some light on this for me?
>>>
>>> sessionInfo()
>>> R version 2.11.1 (2010-05-31)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] GO.db_2.4.1 annotate_1.26.1 org.Sc.sgd.db_2.4.1
>>> GOstats_2.14.0
>>> [5] RSQLite_0.9-2 DBI_0.2-5 graph_1.26.0
>>> Category_2.14.0
>>> [9] AnnotationDbi_1.10.2 Biobase_2.8.0 devtools_0.1
>>> stringr_0.4
>>> [13] roxygen_0.1-2 profr_0.1.1 digest_0.4.2
>>> testthat_0.3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] GSEABase_1.10.0 RBGL_1.24.0 XML_3.1-1
>>> evaluate_0.3 genefilter_1.30.0
>>> [6] plyr_1.1 splines_2.11.1 survival_2.35-8
>>> tools_2.11.1 xtable_1.5-6
>>>
>>>
>>> Thanks,
>>> -steve
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>> | Memorial Sloan-Kettering Cancer Center
>>> | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>
>
>
More information about the Bioconductor
mailing list