[BioC] GO with R and Bioconductor
Vincent Carey
stvjc at channing.harvard.edu
Sun Feb 27 02:09:05 CET 2011
On Sat, Feb 26, 2011 at 6:40 PM, Duke <duke.lists at gmx.com> wrote:
> Hi Vincent,
>
> On 2/26/11 2:38 PM, Vincent Carey wrote:
>>
>> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at gmx.com> wrote:
>>>
>>> Dear colleagues,
>>>
>>> I used to download GO database from geneoncology.org and did some c++
>>> coding to manipulate the data as I wished. Now I want to try my luck with R
>>> - Bioconductor. I have heard of tons of tools supporting GO such as GO.db,
>>> topGO, goseq, GOstats, biomart etc... and I have been reading their
>>> description and examples, but honestly I am overhelmed and dont really know
>>> which package I should use to fulfill my task. So please advise me how I can
>>> do the following two simple tasks:
>>>
>>> 1. I have a list of genes (with gene names from UCSC such as Foxp3
>>> etc...). How do I filter this list to get genes that have certain GO term
>>> such as transcription factor?
>>
>> since you said it was a simple task, consider the simple solution
>> involving the "%annTo%" operator, which tells whether the symbols on
>> the left have been annotated to the term on the right:
>>
>>> c("FOXP3", "BRCA2") %annTo% "mammary"
>>
>> FOXP3 BRCA2
>> FALSE TRUE
>>>
>>> c("FOXP3", "BRCA2") %annTo% "transcription factor"
>>
>> FOXP3 BRCA2
>> TRUE FALSE
>>
>> you could use the named logical vectors generated in this way to
>> perform the filtering you describe. but see below.
>>
>>> 2. How do I know the capacity of the latest GO database on bioconductor,
>>> for example, how many genes available for mm9, and how many of them have GO
>>> term transcription factor?
>>
>> The "GO database" concerns the gene ontology, a structure of terms and
>> relationships among them. The association of GO terms to gene names
>> for mouse is presented in various ways, but the most basic one is in
>> the org.Mm.eg.db package. With that, you could use
>>
>> org.Mm.eg()
>>
>> to find, among other statistics,
>>
>> org.Mm.egGO has 29984 mapped keys (of 63329 keys). Your question
>> concerning transcription factor mapping is not completely precise, and
>> you might want to survey the family of GO terms to come up with a set
>> of terms that meets your requirement.
>
> You are right, I did not make it clear. What I wanted was exactly what you
> said: I need to get a list of genes that have a certain set of terms (for
> example *transcription*) - which corresponds to a family of GO terms, not
> just one GO term.
>
>> Here's a
>> demonstration of related queries:
>>
>>> get("GO:0003700", GOTERM)
>>
>> GOID: GO:0003700
>> Term: sequence-specific DNA binding transcription factor activity
>> Ontology: MF
>> Definition: Interacting selectively and non-covalently with a specific
>> DNA sequence in order to modulate transcription. The transcription
>> factor may or may not also interact selectively with a protein or
>> macromolecular complex.
>> Synonym: GO:0000130
>> Secondary: GO:0000130
>>>
>>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>>> length(tfg)
>>
>> [1] 940
>
> This works fine in case of one GO term (GO: 0003700). Is there a similar
> function like getterm("transcription", revmap()) to get all the genes that
> their GO terms contain *transcription*?
As far as I know there is no such function. Solutions depend on the
type of wild-card you are looking for. If SQL LIKE operator is
adequate
termlikeToTags = function (liketok)
{
require(GO.db)
gcon = GO_dbconn()
as.character(dbGetQuery(gcon, paste("select go_id from go_term
where term like '%",
liketok, "%'", sep = ""))[[1]])
}
if you want to use full regular expressions
reToTags = function (retok, ...) {
require(GO.db)
allt = dbGetQuery(GO_dbconn(), "select go_id, term from go_term")
inds = grep(retok, as.character(allt[,"term"]), value=FALSE, ...)
# will error if value set at call
if (length(inds)>0) return(as.character(allt[inds,"go_id"]))
stop("retok did not grep to any term")
}
> length(reToTags("transcription"))
[1] 372
> length(termlikeToTags("transcription"))
[1] 372
> length(reToTags("transcrip.*"))
[1] 411
> length(termlikeToTags("transcrip.*"))
[1] 0
>
>> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
>> revmap reverses this mapping and takes a tag to the set of genes
>> mapped to the tag by entrez.
>>
>> Now, to return to the first question -- it isn't simple and a lot of
>> presuppositions have to be made explicit. One of the most problematic
>> is the commitment to use gene symbols. If you don't read the docs
>> about bioconductor annotation and R packages pertaining thereto, it's
>> hard to make progress.
>
> Can you elaborate a little bit more why using gene symbols is problematic?
> And if it is really a problem, what gene ID I should use to avoid that
> problem?
Entrez gene IDs are generally more specific. If you haven't run into
collisions and ambiguities with symbols by now, maybe you don't have
to worry about it.
>
>> My code defining "%annTo%" will follow -- it
>> is not particularly efficient or well-designed, but it shows the
>> components that have to be identified and used for this to work. All
>> the annotation is based on SQLite tables distributed with Bioconductor
>> packages, with interfaces defined in DBI and RSQLite packages. The
>> %annTo% operator is defined below. More thoughtful solutions are
>> invited.
>>
>> sym2terms = function (sym)
>> {
>> if (length(sym)>1) stop("scalar symbol input required") # bad form
>> require(GO.db)
>> require(org.Hs.eg.db) # generalize
>> egs = sapply(sym, function(z) get(z, revmap(org.Hs.egSYMBOL)))
>> gos = mget(egs, org.Hs.egGO)
>> goids = sapply(gos, lapply, "[[", "GOID")
>> sapply(lapply(goids, function(z) get(z, GOTERM)), "Term")
>> }
>>
>> genesAnnotatedTo = function(syms, term="transcription factor",
>> meth=agrep) {
>> tms = lapply(syms, sym2terms)
>> chk = sapply(tms, function(z) length(agrep(term, z))>0)
>> names(chk) = syms
>> chk
>> }
>>
>> "%annTo%" = function(x,y) genesAnnotatedTo(x,y)
>>
>
> Thanks for this %annTo% function. It works perfectly!
>
> Regards,
>
> D.
>
More information about the Bioconductor
mailing list