[BioC] GO classification
Martin Morgan
mtmorgan at fhcrc.org
Fri Oct 12 22:32:30 CEST 2007
Hi Thomas --
I like the idea of incorporating slims into an existing package (and
have started to in the development version of GSEABase, version 1.1.5,
when it appears). I've been playing around a bit and wanted to get
some feedback from you and the list.
There's an immediately usable version of the underlying code at the
end of the message (no devel version of GSEABase required). The code
below is considerably faster than that posted earlier.
In GSEABase, the function getOBOCollection parses 'obo' files, as
these seem to be how 'slim' sets are sometimes stored. With GSEABase,
you'd
> fl <- "http://www.geneontology.org/GO_slims/goslim_plant.obo"
> oboSlim <- getOBOCollection(fl)
> oboSlim
collectionType: GO
ids: GO:0000003, GO:0000166, ..., GO:0045182 (106 total)
evidenceCode: IMP IPI TAS ISS IDA NAS IEA IGI RCA IEP IC NR ND
'fl' could be a local file. You could also create an ad hoc collection
with
> slimTerms <- GOCollection(c("GO:0030246", "GO:0008289", "GO:0003676",
+ "GO:0000166"))
Use the collection of slim ids to classify your own ids,
e.g.,
> my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345",
+ "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265",
+ "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
+ "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046",
+ "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618",
+ "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311",
+ "GO:0005622", "GO:0009282")
> goSlim(GOCollection(my_GOIds), oboSlim, "MF")
Count Percent Term
GO:0000166 0 0.000000 nucleotide binding
GO:0003674 6 40.000000 molecular_function
GO:0003676 1 6.666667 nucleic acid binding
GO:0003677 1 6.666667 DNA binding
GO:0003682 0 0.000000 chromatin binding
GO:0003700 0 0.000000 transcription factor activity
GO:0003723 0 0.000000 RNA binding
GO:0003774 0 0.000000 motor activity
GO:0003824 3 20.000000 catalytic activity
GO:0004518 0 0.000000 nuclease activity
GO:0004871 0 0.000000 signal transducer activity
GO:0004872 0 0.000000 receptor activity
GO:0005102 0 0.000000 receptor binding
GO:0005198 0 0.000000 structural molecule activity
GO:0005215 0 0.000000 transporter activity
GO:0005488 2 13.333333 binding
GO:0005515 0 0.000000 protein binding
GO:0008135 0 0.000000 translation factor activity, nuclei...
GO:0008289 0 0.000000 lipid binding
GO:0016301 0 0.000000 kinase activity
GO:0016740 1 6.666667 transferase activity
GO:0016787 0 0.000000 hydrolase activity
GO:0019825 0 0.000000 oxygen binding
GO:0030234 0 0.000000 enzyme regulator activity
GO:0030246 0 0.000000 carbohydrate binding
GO:0030528 1 6.666667 transcription regulator activity
GO:0045182 0 0.000000 translation regulator activity
This is just a data frame, with rows being slim ids. More fun (and
needing some more thought -- see below) is asking about the GO
categories implied by an expression set, e.g.,
> data(sample.ExpressionSet)
> goSlim(sample.ExpressionSet, oboSlim, "MF")
Count Percent Term
GO:0000166 5 0.6157635 nucleotide binding
GO:0003674 283 34.8522167 molecular_function
GO:0003676 17 2.0935961 nucleic acid binding
GO:0003677 10 1.2315271 DNA binding
GO:0003682 2 0.2463054 chromatin binding
GO:0003700 1 0.1231527 transcription factor activity
GO:0003723 5 0.6157635 RNA binding
GO:0003774 2 0.2463054 motor activity
GO:0003824 82 10.0985222 catalytic activity
GO:0004518 5 0.6157635 nuclease activity
GO:0004871 36 4.4334975 signal transducer activity
GO:0004872 29 3.5714286 receptor activity
GO:0005102 15 1.8472906 receptor binding
GO:0005198 7 0.8620690 structural molecule activity
GO:0005215 44 5.4187192 transporter activity
GO:0005488 112 13.7931034 binding
GO:0005515 49 6.0344828 protein binding
GO:0008135 1 0.1231527 translation factor activity, nuclei...
GO:0008289 4 0.4926108 lipid binding
GO:0016301 13 1.6009852 kinase activity
GO:0016740 23 2.8325123 transferase activity
GO:0016787 40 4.9261084 hydrolase activity
GO:0019825 1 0.1231527 oxygen binding
GO:0030234 14 1.7241379 enzyme regulator activity
GO:0030246 3 0.3694581 carbohydrate binding
GO:0030528 8 0.9852217 transcription regulator activity
GO:0045182 1 0.1231527 translation regulator activity
I'm not sure I've made the right decisions about duplicated GO terms,
and would appreciate any feedback. Here's one example:
> goSlim(GOCollection("GO:0000016"), oboSlim, "MF")
Count Percent Term
GO:0000166 0 0.00000 nucleotide binding
GO:0003674 1 33.33333 molecular_function
GO:0003676 0 0.00000 nucleic acid binding
GO:0003677 0 0.00000 DNA binding
GO:0003682 0 0.00000 chromatin binding
GO:0003700 0 0.00000 transcription factor activity
GO:0003723 0 0.00000 RNA binding
GO:0003774 0 0.00000 motor activity
GO:0003824 1 33.33333 catalytic activity
GO:0004518 0 0.00000 nuclease activity
GO:0004871 0 0.00000 signal transducer activity
GO:0004872 0 0.00000 receptor activity
GO:0005102 0 0.00000 receptor binding
GO:0005198 0 0.00000 structural molecule activity
GO:0005215 0 0.00000 transporter activity
GO:0005488 0 0.00000 binding
GO:0005515 0 0.00000 protein binding
GO:0008135 0 0.00000 translation factor activity, nuclei...
GO:0008289 0 0.00000 lipid binding
GO:0016301 0 0.00000 kinase activity
GO:0016740 0 0.00000 transferase activity
GO:0016787 1 33.33333 hydrolase activity
GO:0019825 0 0.00000 oxygen binding
GO:0030234 0 0.00000 enzyme regulator activity
GO:0030246 0 0.00000 carbohydrate binding
GO:0030528 0 0.00000 transcription regulator activity
GO:0045182 0 0.00000 translation regulator activity
I'm also intending to enable expression set subsetting based on
GOCollections (you can currently do this with GeneSets defined in
GSEABase).
Here are a couple of issues.
The hierarchies in oboSlim overlap, so GO:0000016 gets classified
three different ways. I think this is one of the problems Thomas
alluded to.
A more complicated example is going from ExpressionSet. Each probe
might map to several different GO terms (each of which can be
classified to several different slim terms).
Several features in an ExpressionSet can map to the same GO
term. Currently, I make the GO terms unique, so this
> goSlim(GOCollection(c("GO:0000016", "GO:0000016")), oboSlim, "MF")
produces the same result as above. The result of goSlim is the
classification of all unique terms implied by the feature names,
unweighted by the frequency of the term implied by the feature
names. Again I'm not sure that this is the most helpful; certainly it
makes any kind of statistical assessment difficult.
For those wanting to use this kind of functionality without heading
into the devel branch immediately, here's some code and how to use it:
## Group GO ids into GO_ontology-specific GO_slim categories
GO_slim <- function(ids, GO_slim, GO_ontology="MF", verbose=FALSE) {
require("AnnotationDbi")
require("GO")
## Get GO_slim terms, restricted to GO_ontology
terms <- mget(GO_slim, GOTERM, ifnotfound=NA)
if (any(is.na(terms))) {
if (verbose)
warning("GO_slim ids not found: ",
paste(names(terms)[is.na(terms)], collapse=" "))
terms <- terms[!is.na(terms)]
}
terms <- terms[sapply(terms, Ontology)==GO_ontology]
GO_slim <- names(terms)
## Use GO_ontology to find the required offspring
OFFSPRING <- switch(GO_ontology,
MF=GOMFOFFSPRING,
BP=GOBPOFFSPRING,
CC=GOCCOFFSPRING,
stop("GO_ontology must be 'MF', 'BP', or 'CC'"))
## Get the offspring of GO_slim
slim <- mget(GO_slim, OFFSPRING, ifnotfound=NA)
slim <- slim[!is.na(slim)]
## Reverse the relationship: 'offspring' become keys, 'parents'
## become values. Select the sampled offspring
ids <- unique(ids)
samp <- revmap(slim)[ids]
samp <- samp[!sapply(samp, is.null)]
## Count occurences of each slim
cnt <- table(unlist(samp))
## Adjust for sample ids matching slim ids
idx <- table(ids[which(ids %in% names(slim))])
idx_n<- names(idx)
cnt[idx_n] <- idx + ifelse(is.na(cnt[idx_n]), 0, cnt[idx_n])
## Prepare a data frame for results
df <- data.frame(Slim=names(terms),
Count=0L, Percent=0,
Term=sprintf("%.35s%s",
sapply(terms, Term, USE.NAMES=FALSE),
ifelse(nchar(sapply(terms, Term))>35, "...", "")),
row.names=1)
## add our counts
df[names(cnt),c("Count", "Percent")] <- c(cnt, 100*cnt/sum(cnt))
df[order(row.names(df)),]
}
## Read 'obo' files for GO ids, from the web or disk
GO_oboIds <- function(src) {
## Parse OBO into 'stanza' and 'kv' (key-value) tables.
## VERY NAIVE
data <- readLines(src)
parser <- list(stanza="^\\[(.*)\\]", kv="^([^:]*):\\s*(.*)")
stanza <- data.frame(id=c(0,grep(parser$stanza, data)),
value=c("Root", sub(parser$stanza, "\\1",
grep(parser$stanza, data, value=TRUE))),
stringsAsFactors=FALSE)
kv_pairs <- grep(parser$kv, data, value=TRUE)
kv_id <- grep(parser$kv, data)
stanza_id <- sapply(kv_id, function(x) {
idx <- x > stanza$id
stanza$id[xor(idx, c(idx[-1], FALSE))]
})
kv <- data.frame(id=kv_id, stanza_id=stanza_id,
key=sub(parser$kv, "\\1", kv_pairs),
value=sub(parser$kv, "\\2", kv_pairs),
stringsAsFactors=FALSE)
## Get GO ids
merge(kv[kv$key=="id", names(kv)!="key", drop=FALSE],
stanza[stanza$value=="Term", names(stanza)!="value",
drop=FALSE],
by.x="stanza_id", by.y="id")$value
}
> ## Thomas Girke's original (Sept 2005) slim ids, from
> ## http://www.geneontology.org/GO.slims.shtml
> slimIds <- c("GO:0030246", "GO:0008289", "GO:0003676", "GO:0000166",
> "GO:0019825", "GO:0005515", "GO:0003824", "GO:0030234",
> "GO:0005554", "GO:0003774", "GO:0004871", "GO:0005198",
> "GO:0030528", "GO:0045182", "GO:0005215", "GO:0000004",
> "GO:0006519", "GO:0007154", "GO:0007049", "GO:0016043",
> "GO:0006412", "GO:0006464", "GO:0006810", "GO:0007275",
> "GO:0007049", "GO:0006519", "GO:0005975", "GO:0006629",
> "GO:0006139", "GO:0019748", "GO:0015979", "GO:0005618",
> "GO:0005829", "GO:0005783", "GO:0005768", "GO:0005794",
> "GO:0005739", "GO:0005777", "GO:0009536", "GO:0005840",
> "GO:0005773", "GO:0005764", "GO:0005856", "GO:0005634",
> "GO:0005886", "GO:0008372", "GO:0005576")
> ## Some GO ids
> my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345",
> "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265",
> "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
> "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046",
> "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618",
> "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311",
> "GO:0005622", "GO:0009282")
> ## classify GOIds into slimIds for the "MF" ontology
> res <- GO_slim(my_GOIds, slimIds, "MF")
> idx <- res$Count>0
> pie(res[idx, "Count"], row.names(res)[idx])
> data(sample.ExpressionSet)
> require(annotation(sample.ExpressionSet))
> evidenceCode <- "TAS" # restict to particular evidence codes
> res <- mget(featureNames(sample.ExpressionSet), hgu95av2GO)
> res <- res[!is.na(res)]
> gids <- unlist(lapply(res, subListExtract, "GOID"))
> ecode <- unlist(lapply(res, subListExtract, "Evidence"))
> ugids <- unique(gids[ecode %in% evidenceCode])
> df <- GO_slim(ugids, GO_oboIds(src), "MF")
Martin
Thomas Girke <thomas.girke at ucr.edu> writes:
> GO_Slim categories are usually sets of very general custom GO
> terms. Examples of these GO_Slim sets can be downloaded from
> geneontology.org:
>
> http://www.geneontology.org/GO.slims.shtml
>
> When working with GOstats, you can simply use your favorite GO_Slim
> set for subsetting your enrichment analysis results and then plot
> the corresponding counts. Usually, GO_Slim representations,
> especially pie charts, pretend that the different items (e.g. genes)
> are assigned to only one category - which is typically not the case
> - since there are duplicates almost everywhere.
>
> As a suggestion to the developers:
> Considering the popularity of these GO_Slim representations, it
> might be useful to add some instructions in the GOstats PDF that
> illustrate to users how to generate counts and plots for their
> favorite GO_Slim categories? Perhaps with a proper warning about the
> limitations of these analyses.
>
> Best,
>
> Thomas
>
>
> On Wed 10/10/07 11:23, Celine Carret wrote:
>> Hi,
>> you should have a look at this:
>> http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/goSlim.t
>> xt
>>
>> I found it very clear, and useful, and if you don't work on human-chips
>> you can customise the script easily.
>>
>> Best regards
>> Celine
>>
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch
>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Ganiraju
>> Sent: 09 October 2007 20:37
>> To: bioconductor at stat.math.ethz.ch
>> Subject: [BioC] GO classification
>>
>> hi,
>>
>> I got a set of significant GOids by running my data using GOhypergtest
>> of
>> Gostats. Now im trying to classify these GO ids using the GO_slim
>> classifier. Is there any package in R which can accomplish this job??
>>
>> Thanks
>> Gani
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research
>> Limited, a charity registered in England with number 1021457 and a
>> company registered in England with number 2742969, whose registered
>> office is 215 Euston Road, London, NW1 2BE.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> --
> Thomas Girke
> Assistant Professor of Bioinformatics
> Director, IIGB Bioinformatic Facility
> Center for Plant Cell Biology (CEPCEB)
> Institute for Integrative Genome Biology (IIGB)
> Department of Botany and Plant Sciences
> 1008 Noel T. Keen Hall
> University of California
> Riverside, CA 92521
>
> E-mail: thomas.girke at ucr.edu
> Website: http://faculty.ucr.edu/~tgirke
> Ph: 951-827-2469
> Fax: 951-827-4437
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology Shared Resource Director
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (208) 667-2793
More information about the Bioconductor
mailing list