[BioC] GO with R and Bioconductor
Duke
duke.lists at gmx.com
Sun Feb 27 00:40:28 CET 2011
Hi Vincent,
On 2/26/11 2:38 PM, Vincent Carey wrote:
> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at gmx.com> wrote:
>> Dear colleagues,
>>
>> I used to download GO database from geneoncology.org and did some c++ coding to manipulate the data as I wished. Now I want to try my luck with R - Bioconductor. I have heard of tons of tools supporting GO such as GO.db, topGO, goseq, GOstats, biomart etc... and I have been reading their description and examples, but honestly I am overhelmed and dont really know which package I should use to fulfill my task. So please advise me how I can do the following two simple tasks:
>>
>> 1. I have a list of genes (with gene names from UCSC such as Foxp3 etc...). How do I filter this list to get genes that have certain GO term such as transcription factor?
> since you said it was a simple task, consider the simple solution
> involving the "%annTo%" operator, which tells whether the symbols on
> the left have been annotated to the term on the right:
>
>> c("FOXP3", "BRCA2") %annTo% "mammary"
> FOXP3 BRCA2
> FALSE TRUE
>> c("FOXP3", "BRCA2") %annTo% "transcription factor"
> FOXP3 BRCA2
> TRUE FALSE
>
> you could use the named logical vectors generated in this way to
> perform the filtering you describe. but see below.
>
>> 2. How do I know the capacity of the latest GO database on bioconductor, for example, how many genes available for mm9, and how many of them have GO term transcription factor?
> The "GO database" concerns the gene ontology, a structure of terms and
> relationships among them. The association of GO terms to gene names
> for mouse is presented in various ways, but the most basic one is in
> the org.Mm.eg.db package. With that, you could use
>
> org.Mm.eg()
>
> to find, among other statistics,
>
> org.Mm.egGO has 29984 mapped keys (of 63329 keys). Your question
> concerning transcription factor mapping is not completely precise, and
> you might want to survey the family of GO terms to come up with a set
> of terms that meets your requirement.
You are right, I did not make it clear. What I wanted was exactly what
you said: I need to get a list of genes that have a certain set of terms
(for example *transcription*) - which corresponds to a family of GO
terms, not just one GO term.
> Here's a
> demonstration of related queries:
>
>> get("GO:0003700", GOTERM)
> GOID: GO:0003700
> Term: sequence-specific DNA binding transcription factor activity
> Ontology: MF
> Definition: Interacting selectively and non-covalently with a specific
> DNA sequence in order to modulate transcription. The transcription
> factor may or may not also interact selectively with a protein or
> macromolecular complex.
> Synonym: GO:0000130
> Secondary: GO:0000130
>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>> length(tfg)
> [1] 940
This works fine in case of one GO term (GO: 0003700). Is there a similar
function like getterm("transcription", revmap()) to get all the genes
that their GO terms contain *transcription*?
> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
> revmap reverses this mapping and takes a tag to the set of genes
> mapped to the tag by entrez.
>
> Now, to return to the first question -- it isn't simple and a lot of
> presuppositions have to be made explicit. One of the most problematic
> is the commitment to use gene symbols. If you don't read the docs
> about bioconductor annotation and R packages pertaining thereto, it's
> hard to make progress.
Can you elaborate a little bit more why using gene symbols is
problematic? And if it is really a problem, what gene ID I should use to
avoid that problem?
> My code defining "%annTo%" will follow -- it
> is not particularly efficient or well-designed, but it shows the
> components that have to be identified and used for this to work. All
> the annotation is based on SQLite tables distributed with Bioconductor
> packages, with interfaces defined in DBI and RSQLite packages. The
> %annTo% operator is defined below. More thoughtful solutions are
> invited.
>
> sym2terms = function (sym)
> {
> if (length(sym)>1) stop("scalar symbol input required") # bad form
> require(GO.db)
> require(org.Hs.eg.db) # generalize
> egs = sapply(sym, function(z) get(z, revmap(org.Hs.egSYMBOL)))
> gos = mget(egs, org.Hs.egGO)
> goids = sapply(gos, lapply, "[[", "GOID")
> sapply(lapply(goids, function(z) get(z, GOTERM)), "Term")
> }
>
> genesAnnotatedTo = function(syms, term="transcription factor",
> meth=agrep) {
> tms = lapply(syms, sym2terms)
> chk = sapply(tms, function(z) length(agrep(term, z))>0)
> names(chk) = syms
> chk
> }
>
> "%annTo%" = function(x,y) genesAnnotatedTo(x,y)
>
Thanks for this %annTo% function. It works perfectly!
Regards,
D.
More information about the Bioconductor
mailing list