[BioC] mistmatch in GO terms between topGO_1.14.0 and org.Mm.eg.db_2.3.6

Wed Mar 3 17:18:31 CET 2010

Hi Adrian,

Thanks very much for your reply.  Your example for building the topGO object was very helpful.

Another question:  Do you have a favorite way to summarize the topGO output?  What I am trying to do is something like CateGOrizer: http://www.animalgenome.org/bioinfo/tools/catego/
that uses higher level GO terms to give a summary overview of the enriched GO terms.

Thanks very much,
Dick
*******************************************************************************
Richard P. Beyer, Ph.D.	University of Washington
Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
 			Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
http://staff.washington.edu/~dbeyer
*******************************************************************************

On Wed, 3 Mar 2010, Adrian Alexa wrote:

> Hi Dick,
>
> as Sean already mentioned the org.Mm.egGO2EG contains only the most
> specific GO annotations. topGO doesn't care if the supply the most
> specif gene-to-GO mappings or the complete mappings. You will obtain
> the same result if you use either org.Mm.egGO2EG or
> org.Mm.egGO2ALLEGS. However, do to the redundancies in the
> org.Mm.egGO2ALLEGS mappings I advise in using the most specific
> mappings.
>
> Also, since you are using a Bioconductor annotation package, you don't
> need to construct the gene2GO list to provide the annotations. There
> is a function, namely "annFUN.org" which is more convenient  to use
> when building the "topGOdata" object. In this case the instantiation
> of a topGOdata object will look like:
>
> GOdata <- new("topGOdata",
>                   ontology = "BP",
>                   allGenes = geneList,
>                   nodeSize = 5,
>                   annot = annFUN.org,
>                   mapping = "org.Mm.eg.db",
>                   ID = "entrez")
>
> The "mapping" argument tells which annotation chip to be use and the
> "ID" argument selects one of the gene identifiers to be use.
>
>
> You can also use functions from topGO to access the genes annotated to
> a GO term of interest.
>
> # all the genes annotated to GO:0030522 -- NOT only the most specific ones!
> myGenes <- genesInTerm(GOdata, "GO:0030522")
>
> # the number of annotated genes
> no.myGenes <- countGenesInTerm(GOdata, "GO:0030522")
>
>
> Hope this helps. Let me know if you have additional questions.
>
>
> Regards,
> Adrian
>
>
>
>
>
>
>
>
>
>
> On Wed, Mar 3, 2010 at 7:32 AM, Dick Beyer <dbeyer at u.washington.edu> wrote:
>> Hi Sean,
>>
>> Thanks very much for looking into this.  I guess I need to think about this.
>>  What is confusing to me is topGO takes a gene2GO list as input (a list of
>> GO terms for each gene), which I generated from org.Mm.egGO2EG (no
>> GO:0030522, for example). Getting GOIDs out of topGO that are in
>> org.Mm.egGO2ALLEGS rather than org.Mm.egGO2EG makes me think I should build
>> my gene2GO input list from org.Mm.egGO2ALLEGS rather than org.Mm.egGO2EG.
>>
>> I also didn't dig far enough when I checked GO:0030522 at geneontology.org,
>> which showed 34 gene products for Mus musculus.  However, had I looked
>> further I would have seen GO:0030522 has no genes of its own.
>>
>> Until recently, I used ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz for
>> getting Entrez Gene ID/GOIDs mappings, but switched to the Bioconductor
>> org.Mm.eg.db way as it is much simplier.
>>
>> Thanks for the good education!
>>
>> Cheers,
>> Dick
>> *******************************************************************************
>> Richard P. Beyer, Ph.D. University of Washington
>> Tel.:(206) 616 7378     Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696     4225 Roosevelt Way NE, # 100
>>                        Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>> http://staff.washington.edu/~dbeyer
>> *******************************************************************************
>>
>> On Tue, 2 Mar 2010, Sean Davis wrote:
>>
>>> On Tue, Mar 2, 2010 at 7:15 PM, Dick Beyer <dbeyer at u.washington.edu>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I've been running topGO (using mouse Entrez Gene IDs) and found that some
>>>> GO terms that turn up in the topGO analysis are not in the GO terms from
>>>> org.Mm.eg.db.
>>>>
>>>> I'd like to give some example code to show how to generate the problem,
>>>> but my topGO code is a lot of lines.  The output looks like:
>>>>
>>>> allResults[[1]][[1]][1:2,]
>>>>         GO.ID                                Term Annotated Significant
>>>> Expected classic    elim weight
>>>> 714 GO:0019222     regulation of metabolic process      2498         143
>>>>   107.08 0.00010 0.17956 0.9057
>>>> 762 GO:0006807 nitrogen compound metabolic process      3413         186
>>>>   146.31 0.00011 0.45337 0.9434
>>>>
>>>> So, the topGO output gives a column of GOIDs and such.
>>>>
>>>> Some of the problem GOIDs from topGO are GO:0030522, GO:0051094,
>>>> GO:0031497, GO:0046700.
>>>>
>>>> I can't find these in names(Mm.egGO2EG).
>>>>
>>>> library("org.Mm.eg.db")
>>>> Mm.egGO2EG <- as.list(org.Mm.egGO2EG)
>>>> grep("GO:0030522",names(Mm.egGO2EG))
>>>> integer(0)
>>>>
>>>> Is it possible that topGO depends on GO.db, and I'm using org.Mm.eg.db?
>>>>  When I check for GO:0030522 for Mus musculus at geneontology.org,
>>>> GO:0030522 is valid.
>>>>
>>>> I'm puzzled by the mismatch.  I want to get the genes for a given GOID,
>>>> so there is probably a work around.  If anyone has a suggestion or idea, I'd
>>>> be very grateful to know what to try.
>>>>
>>>
>>> Hi, Dick.
>>>
>>> The Gene Ontology, as I'm sure everyone knows, is hierarchical.  The
>>> org.Mm.egGO2EG table stores ONLY the most specific term for each gene.
>>> However, the org.Mm.egGO2ALLEGS stores the term and all the genes for
>>> itself AND its children.  Most of the gene ontology analysis
>>> algorithms use the latter definition; it looks like topGO does also.
>>> In short, try this:
>>>
>>> get('GO:0030522',org.Mm.egGO2ALLEGS)
>>>    IDA      IMP      IDA      IGI      IMP      IGI      IMP      IMP
>>> "11835"  "11835"  "11848"  "12034"  "12034"  "13082"  "13123"  "13983"
>>>    IMP      ISO      IMP      IDA      IMP      IMP      IMP      ISO
>>> "14228"  "14599"  "14602"  "14815"  "14815"  "15502"  "16000"  "16000"
>>>    IDA      IDA      IMP      IDA      IGI      IMP      IMP      IDA
>>> "16601"  "18667"  "18854"  "19213"  "19378"  "19378"  "19411"  "20181"
>>>    IDA      IDA      IMP      IMP      IMP      IPI      IDA      IGI
>>> "20182"  "20183"  "20779"  "21815"  "21848"  "22215"  "24074"  "27401"
>>>    IMP      ISA      IDA      IDA      IMP      IDA
>>> "56351"  "56847"  "59035"  "67488" "224903" "232174"
>>>
>>> Hope that helps clear things up.
>>>
>>> Sean
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>