[BioC] GO's to gene's
Loren Engrav
engrav at u.washington.edu
Tue Mar 2 05:54:55 CET 2010
This is fun, least for me
But I am smiling, you want me to write a query? I can barely plagiarize your
commands.
Ok I can do this
You could not find 9697, sobeit.
You found 1471 so why do I have it missing? My mistake, it was on the
previous printout page.
Ok so I try your method for 3570 and 3091.
> names (org.Hs.egGO[["3570"]])
[1] "GO:0002384" "GO:0002548" "GO:0002690" "GO:0006953" "GO:0050829"
"GO:0042981" "GO:0031018"
[8] "GO:0031018" "GO:0032722" "GO:0032755" "GO:0034097" "GO:0042517"
"GO:0045669" "GO:0045768"
[15] "GO:0048661" "GO:0050731" "GO:0070102" "GO:0070120" "GO:0005886"
"GO:0016021" "GO:0005576"
[22] "GO:0005896" "GO:0016324" "GO:0005102" "GO:0004872" "GO:0004897"
"GO:0004915" "GO:0019899"
[29] "GO:0042803" "GO:0070119"
Nope, 0032966 not there. I check Amigo and it is there.
> names (org.Hs.egGO[["3091"]])
[1] "GO:0001666" "GO:0001755" "GO:0001837" "GO:0001892" "GO:0001938"
"GO:0001947" "GO:0002248"
[8] "GO:0007165" "GO:0006089" "GO:0006355" "GO:0006879" "GO:0010575"
"GO:0010634" "GO:0014850"
[15] "GO:0042981" "GO:0046886" "GO:0030154" "GO:0030949" "GO:0032364"
"GO:0032722" "GO:0032909"
[22] "GO:0032963" "GO:0035162" "GO:0042541" "GO:0042593" "GO:0042789"
"GO:0043193" "GO:0043619"
[29] "GO:0045648" "GO:0045766" "GO:0045821" "GO:0045926" "GO:0045941"
"GO:0045944" "GO:0046716"
[36] "GO:0050790" "GO:0051000" "GO:0051216" "GO:0051541" "GO:0005634"
"GO:0005737" "GO:0005667"
[43] "GO:0005730" "GO:0009434" "GO:0003705" "GO:0004871" "GO:0008134"
"GO:0051879" "GO:0035035"
[50] "GO:0043565" "GO:0046982" "GO:0046982"
Yup, 0032963 is there, so why missed? So
> org.Hs.egGO2EG[["GO:0032963"]]
IEA IMP
"3091" "7148"
And I trimmed IEA. But Amigo indicates the evidence is ISS.
So we have
Two not there
One my mistake and
One org.Hs.eg.db lists as IEA and Amigo as ISS.
I suppose since this question can be answered quite easily with Amigo and
they update Amigo assocdb weekly, I should just stick with Amigo for
questions like this. But R/BioC is more fun. And once you have the commands
in the R.app history, redoing the event is painless, sort of.
Again, thank you.
> From: Martin Morgan <mtmorgan at fhcrc.org>
> Date: Mon, 01 Mar 2010 19:49:05 -0800
> To: Loren Engrav <engrav at u.washington.edu>
> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] GO's to gene's
>
> On 03/01/2010 06:34 PM, Loren Engrav wrote:
>> Thank you
>> You are clearly very good at this
>>
>> So to check it all out I did it manually on Amigo. Amigo found 33 genes
>> (limited to Human and omitting IEA)
>>
>> The org.HS.eg.db method found 29 of the 33 but did not find
>> CST3 (1471) GO:0010711 IEP
>> HIF1A (3091) GO:0032963 ISS
>> IL6R (3570), GO:0032966 IDA and
>> TRAM2 (9697) GO:0032964 IMP
>>
>> I suppose to figure out, for example, why org.Hs.eg.db does not map 9697 to
>> GO:0032964 is complex
>
>> names(org.Hs.egGO[["9697"]])
> [1] "GO:0015031" "GO:0065002" "GO:0016020" "GO:0016021"
>
> Hmm, what are the offspring / ancestors of GO:0032964 ?
>
>> GOBPOFFSPRING[["GO:0032964"]]
> [1] "GO:0032965" "GO:0032966" "GO:0032967"
>> GOBPANCESTOR[["GO:0032964"]]
> [1] "all" "GO:0008152" "GO:0008150" "GO:0009058" "GO:0009059"
> [6] "GO:0032501" "GO:0032963" "GO:0043170" "GO:0044236" "GO:0044259"
>
> Nope nothing jumping out. Where's the GO data coming from?
>
>> org.Hs.eg() ## or GO()
> [snip]
> Date for GO data: 20090830
>
> Whereas AMIGO says (at the bottom of each page)
>
> GO database release 2010-02-27
>
> so that looks like a likely issue that would require some more
> substantial investigation. Merits of using a 'current' db (Amigo) vs a
> 'versioned' db (GO.db)? See mailing list archives, e.g., current
> state-of-knowledge vs. reproducibility (how would we redo the analysis
> we did last month and get the same results with AMIGO?).
>
> On the other hand
>
>> org.Hs.egGO2EG[["GO:0010711"]]
> IEP
> "1471"
>> GOTERM[["GO:0010711"]]
> GOID: GO:0010711
> Term: negative regulation of collagen catabolic process
> Ontology: BP
> Definition: Any process that decreases the rate, frequency or extent of
> collagen catabolism. Collagen catabolism is the proteolytic
> chemical reactions and pathways resulting in the breakdown of
> collagen in the extracellular matrix.
> Synonym: down regulation of collagen catabolic process
> Synonym: down-regulation of collagen catabolic process
> Synonym: downregulation of collagen catabolic process
> Synonym: inhibition of collagen catabolic process
> Synonym: negative regulation of collagen breakdown
> Synonym: negative regulation of collagen catabolism
> Synonym: negative regulation of collagen degradation
>
> so why didn't we find that one?
>
>> terms <- Term(GOTERM) # or maybe Definition(GOTERM)
>> "GO:0010711" %in% names(terms)
> [1] TRUE
>> terms[["GO:0010711"]]
> [1] "negative regulation of collagen catabolic process"
>
> yep it's there
>
>> ontologies <- Ontology(GOTERM)
>> ontologies[["GO:0010711"]]
> [1] "BP"
>> collagen <- terms[grepl("collagen", terms) & ("BP" == ontologies)]
>> collagen[["GO:0010711"]]
> [1] "negative regulation of collagen catabolic process"
>
> yep it's there (or were we looking for MF, as below?).
>
>> egids[["GO:0010711"]]
> IEP
> "1471"
>
> yep it's there. So this makes me think it's a programming error or a
> miscommunication. I'd suggest you write a little function
>
> getGO <-
> function(termLike, ontology, exludeEvidence)
> {
> ## a few lines of code here, representing the query you perform
> }
>
> and perhaps sharing that with the list will shed some light.
>
> Martin
>
>
>>
>> Thank you
>>
>>
>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>> Date: Mon, 01 Mar 2010 05:16:48 -0800
>>> To: Loren Engrav <engrav at u.washington.edu>
>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>> Subject: Re: [BioC] GO's to gene's
>>>
>>> On 02/28/2010 09:01 PM, Loren Engrav wrote:
>>>> So I checked
>>>>> collagen
>>>> And this list matches Amigo
>>>> So then would appear the issue lies in
>>>>> egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>>>> Some of the names are finding no associated genes in org.Hs.egGO2EG and so
>>>> appear as NA
>>>> True? Possible?
>>>
>>> yes. GO is not H. sapiens specific and ENTREZ ids are not 100%
>>> comprehensive, so some GO terms do not map to ENTREZ ids.
>>>
>>>>>> Also I would like to omit the IEA group
>>>
>>> maybe
>>>
>>> egids <- lapply(egids, function(elt) elt[names(elt) != "IEA"])
>>> egids[sapply(egids, length) != 0]
>>>
>>> Martin
>>>
>>>> My version of org.Hs.egGO2EG is 2.3.6
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> From: Loren Engrav <engrav at u.washington.edu>
>>>>> Date: Sun, 28 Feb 2010 20:33:05 -0800
>>>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>>> Conversation: [BioC] GO's to gene's
>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>
>>>>> Oopps, Amigo says there are 20 such terms, not 68 as I said before, cuz I
>>>>> retrieved only BP
>>>>>
>>>>>
>>>>>> From: Loren Engrav <engrav at u.washington.edu>
>>>>>> Date: Sun, 28 Feb 2010 20:28:17 -0800
>>>>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>>>> Conversation: [BioC] GO's to gene's
>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>
>>>>>> Ok thank you
>>>>>> I now show
>>>>>>> sessionInfo()
>>>>>> R version 2.10.1 (2009-12-14)
>>>>>> i386-apple-darwin9.8.0
>>>>>>
>>>>>> locale:
>>>>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] org.Hs.eg.db_2.3.6 GO.db_2.3.5 RSQLite_0.8-3
>>>>>> AnnotationDbi_1.8.1 DBI_0.2-5
>>>>>> [6] Biobase_2.6.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] tools_2.10.1
>>>>>>
>>>>>> And all commands pass with no errors, however I see
>>>>>>
>>>>>>> egids
>>>>>> $`GO:0010711`
>>>>>> IEP
>>>>>> "1471"
>>>>>>
>>>>>> $`GO:0030199`
>>>>>> IEA IEA ISS IEA IMP IMP IMP IMP NAS
>>>>>> IMP NAS IMP ISS
>>>>>> "302" "304" "538" "871" "1277" "1278" "1280" "1281" "1281"
>>>>>> "1289" "1289" "1290" "1290"
>>>>>> NAS IDA NAS IEA IEA IEA IEA IEA NAS
>>>>>> ISS IDA ISS NAS
>>>>>> "1301" "1302" "1303" "1805" "2296" "2303" "4010" "4015" "4060"
>>>>>> "4763" "7042" "7046" "7373"
>>>>>> NAS NAS
>>>>>> "9508" "50509"
>>>>>>
>>>>>> $`GO:0030574`
>>>>>> IEA IEA IEA IEA IEA IEA IEA IEA
>>>>>> IEA IEA IEA
>>>>>> "4312" "4313" "4314" "4316" "4317" "4318" "4319" "4320"
>>>>>> "4322" "4325" "4327"
>>>>>> IEA IDA IMP NAS IEA NAS IEA IEA
>>>>>> IEA IEA
>>>>>> "5184" "5645" "5645" "5653" "5657" "9508" "9509" "56547"
>>>>>> "64066" "140766"
>>>>>>
>>>>>> $`GO:0032963`
>>>>>> IEA IMP
>>>>>> "3091" "7148"
>>>>>>
>>>>>> $`GO:0032964`
>>>>>> IEA IMP IMP TAS IMP
>>>>>> "871" "1277" "1281" "1281" "1289"
>>>>>>
>>>>>> $`GO:0032966`
>>>>>> IDA IC
>>>>>> "3569" "4261"
>>>>>>
>>>>>> $`GO:0032967`
>>>>>> ISS IDA IDA IC IMP TAS IMP
>>>>>> "265" "2147" "2149" "3066" "7040" "7040" "7043"
>>>>>>
>>>>>> $`GO:0033342`
>>>>>> IMP
>>>>>> "23560"
>>>>>>
>>>>>> So many GO terms containing the word "collagen" are not listed, like
>>>>>> 0004656
>>>>>> 0005518
>>>>>> etc
>>>>>> Amigo claims there are 68 such terms and the list above has only 8
>>>>>> What did I do wrong?
>>>>>> Also I would like to omit the IEA group
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>>>> Date: Sun, 28 Feb 2010 19:30:34 -0800
>>>>>>> To: Loren Engrav <engrav at u.washington.edu>
>>>>>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>>
>>>>>>> On 02/28/2010 07:17 PM, Loren Engrav wrote:
>>>>>>>> Thank you both
>>>>>>>> Given my skills, it might be easier/quicker to do it "manually" with
>>>>>>>> Amigo
>>>>>>>> But I am trying both methods
>>>>>>>>
>>>>>>>> For the second method I get
>>>>>>>>
>>>>>>>>> library(GO.db)
>>>>>>>> Loading required package: AnnotationDbi
>>>>>>>> Loading required package: Biobase
>>>>>>>>
>>>>>>>> Welcome to Bioconductor
>>>>>>>>
>>>>>>>> Vignettes contain introductory material. To view, type
>>>>>>>> 'openVignette()'. To cite Bioconductor, see
>>>>>>>> 'citation("Biobase")' and for packages 'citation(pkgname)'.
>>>>>>>>
>>>>>>>> Loading required package: DBI
>>>>>>>>> terms <- Term(GOTERM)
>>>>>>>> Error in function (classes, fdef, mtable) :
>>>>>>>> unable to find an inherited method for function "Term", for signature
>>>>>>>> "GOTermsAnnDbBimap"
>>>>>>>>
>>>>>>>>> sessionInfo()
>>>>>>>> R version 2.9.2 Patched (2009-09-05 r49613)
>>>>>>>> i386-apple-darwin9.8.0
>>>>>>>>
>>>>>>>> locale:
>>>>>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>>>> ,
>>>>>>>> attached base packages:
>>>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>>
>>>>>>> Update to R version 2.10 and associated Bioc packages, or for a (much)
>>>>>>> slower solution (you'll want to check that Term and Ontology return ids
>>>>>>> in identical order)
>>>>>>>
>>>>>>> terms = eapply(GOTERM, Term)
>>>>>>>
>>>>>>> etc. I have
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 2.10.1 Patched (2010-02-23 r51168)
>>>>>>> x86_64-unknown-linux-gnu
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>>>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] GO.db_2.3.5 RSQLite_0.7-3 DBI_0.2-4
>>>>>>> [4] AnnotationDbi_1.8.1 Biobase_2.6.1
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] tools_2.10.1
>>>>>>>
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>>>
>>>>>>>>> From: Martin Morgan <mtmorgan at fhcrc.org>
>>>>>>>>> Date: Sun, 28 Feb 2010 18:42:33 -0800
>>>>>>>>> To: Vincent Carey <stvjc at channing.harvard.edu>
>>>>>>>>> Cc: Loren Engrav <engrav at u.washington.edu>,
>>>>>>>>> "bioconductor at stat.math.ethz.ch"
>>>>>>>>> <bioconductor at stat.math.ethz.ch>
>>>>>>>>> Subject: Re: [BioC] GO's to gene's
>>>>>>>>>
>>>>>>>>> On 02/28/2010 06:14 PM, Vincent Carey wrote:
>>>>>>>>>> Perhaps there is a package with such functionality. However, with
>>>>>>>>>> the
>>>>>>>>>> GO.db package in place, you need to do a little
>>>>>>>>>> programming, perhaps along the lines of
>>>>>>>>>>
>>>>>>>>>> querGO = function(str, attr = "definition", ont = "MF") {
>>>>>>>>>> require(GO.db, quietly = TRUE)
>>>>>>>>>> gc = GO_dbconn()
>>>>>>>>>> quer.1 = paste("select go_id, term from go_term where",
>>>>>>>>>> attr, "like('%")
>>>>>>>>>> quer.2 = "%') and ontology = '"
>>>>>>>>>> quer.3 = "'"
>>>>>>>>>> quer = paste(quer.1, str, quer.2, ont, quer.3, collapse = "",
>>>>>>>>>> sep = "")
>>>>>>>>>> dbGetQuery(gc, quer)
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> whereby
>>>>>>>>>>
>>>>>>>>>>> querGO("collagen", "term")
>>>>>>>>>> go_id
>>>>>>>>>> term
>>>>>>>>>> 1 GO:0004656 procollagen-proline 4-dioxygenase
>>>>>>>>>> activity
>>>>>>>>>> 2 GO:0005518 collagen
>>>>>>>>>> binding
>>>>>>>>>> 3 GO:0008475 procollagen-lysine 5-dioxygenase
>>>>>>>>>> activity
>>>>>>>>>> 4 GO:0019797 procollagen-proline 3-dioxygenase
>>>>>>>>>> activity
>>>>>>>>>> 5 GO:0019798 procollagen-proline dioxygenase
>>>>>>>>>> activity
>>>>>>>>>> 6 GO:0033823 procollagen glucosyltransferase
>>>>>>>>>> activity
>>>>>>>>>> 7 GO:0042329 structural constituent of collagen and cuticulin-based
>>>>>>>>>> cuticle
>>>>>>>>>> 8 GO:0050211 procollagen galactosyltransferase
>>>>>>>>>> activity
>>>>>>>>>> 9 GO:0070052 collagen V
>>>>>>>>>> binding
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also
>>>>>>>>>
>>>>>>>>> library(GO.db)
>>>>>>>>> terms <- Term(GOTERM) # or maybe Definition(GOTERM) ?
>>>>>>>>> ontologies <- Ontology(GOTERM)
>>>>>>>>> collagen <- terms[grepl("collagen", terms) & ("MF" == ontologies)]
>>>>>>>>>
>>>>>>>>> and the next step,
>>>>>>>>>
>>>>>>>>> library(org.Hs.eg.db)
>>>>>>>>> egids <- mget(names(collagen), org.Hs.egGO2EG, ifnotfound=NA)
>>>>>>>>> egids <- egids[!is.na(egids)]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Feb 28, 2010 at 8:56 PM, Loren Engrav
>>>>>>>>>> <engrav at u.washington.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>> Is there a BioC package that will find all the GO terms containing
>>>>>>>>>>> some
>>>>>>>>>>> word, like perhaps ³collagen²
>>>>>>>>>>> And then find all the genes contained within those found terms
>>>>>>>>>>>
>>>>>>>>>>> I scanned
>>>>>>>>>>> GoProfiles
>>>>>>>>>>> GOSemSim
>>>>>>>>>>> GOstats
>>>>>>>>>>> GoTools and
>>>>>>>>>>> TopGO
>>>>>>>>>>>
>>>>>>>>>>> And could not determine that any would do that.
>>>>>>>>>>>
>>>>>>>>>>> Thank you.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>>> Search the archives:
>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>> Search the archives:
>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Martin Morgan
>>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>>
>>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>>> Phone: (206) 667-2793
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Martin Morgan
>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N.
>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>> Location: Arnold Building M1 B861
>>>>>>> Phone: (206) 667-2793
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
More information about the Bioconductor
mailing list