[BioC] GO classification
Thomas Girke
thomas.girke at ucr.edu
Fri Oct 12 23:55:52 CEST 2007
Hi Martin,
This preliminary implementation of a GO_Slim function in GSEABase
looks great to me. I particularly like the option to support the
usage of custom GO_Slim sets.
One suggestion regarding the removal of duplicates:
I suggest to make their removal optional on all different levels
(GO-to-Slim and gene-to-GO), and set the default to no duplicate
removal on any level. This, I think, is the most common way these
analyses are performed or how most users would like to perform them.
But I might we wrong here?
Best,
Thomas
On Fri 10/12/07 13:32, Martin Morgan wrote:
> Hi Thomas --
>
> I like the idea of incorporating slims into an existing package (and
> have started to in the development version of GSEABase, version 1.1.5,
> when it appears). I've been playing around a bit and wanted to get
> some feedback from you and the list.
>
> There's an immediately usable version of the underlying code at the
> end of the message (no devel version of GSEABase required). The code
> below is considerably faster than that posted earlier.
>
> In GSEABase, the function getOBOCollection parses 'obo' files, as
> these seem to be how 'slim' sets are sometimes stored. With GSEABase,
> you'd
>
> > fl <- "http://www.geneontology.org/GO_slims/goslim_plant.obo"
> > oboSlim <- getOBOCollection(fl)
> > oboSlim
> collectionType: GO
> ids: GO:0000003, GO:0000166, ..., GO:0045182 (106 total)
> evidenceCode: IMP IPI TAS ISS IDA NAS IEA IGI RCA IEP IC NR ND
>
> 'fl' could be a local file. You could also create an ad hoc collection
> with
>
> > slimTerms <- GOCollection(c("GO:0030246", "GO:0008289", "GO:0003676",
> + "GO:0000166"))
>
> Use the collection of slim ids to classify your own ids,
> e.g.,
>
> > my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345",
> + "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265",
> + "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
> + "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046",
> + "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618",
> + "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311",
> + "GO:0005622", "GO:0009282")
> > goSlim(GOCollection(my_GOIds), oboSlim, "MF")
> Count Percent Term
> GO:0000166 0 0.000000 nucleotide binding
> GO:0003674 6 40.000000 molecular_function
> GO:0003676 1 6.666667 nucleic acid binding
> GO:0003677 1 6.666667 DNA binding
> GO:0003682 0 0.000000 chromatin binding
> GO:0003700 0 0.000000 transcription factor activity
> GO:0003723 0 0.000000 RNA binding
> GO:0003774 0 0.000000 motor activity
> GO:0003824 3 20.000000 catalytic activity
> GO:0004518 0 0.000000 nuclease activity
> GO:0004871 0 0.000000 signal transducer activity
> GO:0004872 0 0.000000 receptor activity
> GO:0005102 0 0.000000 receptor binding
> GO:0005198 0 0.000000 structural molecule activity
> GO:0005215 0 0.000000 transporter activity
> GO:0005488 2 13.333333 binding
> GO:0005515 0 0.000000 protein binding
> GO:0008135 0 0.000000 translation factor activity, nuclei...
> GO:0008289 0 0.000000 lipid binding
> GO:0016301 0 0.000000 kinase activity
> GO:0016740 1 6.666667 transferase activity
> GO:0016787 0 0.000000 hydrolase activity
> GO:0019825 0 0.000000 oxygen binding
> GO:0030234 0 0.000000 enzyme regulator activity
> GO:0030246 0 0.000000 carbohydrate binding
> GO:0030528 1 6.666667 transcription regulator activity
> GO:0045182 0 0.000000 translation regulator activity
>
> This is just a data frame, with rows being slim ids. More fun (and
> needing some more thought -- see below) is asking about the GO
> categories implied by an expression set, e.g.,
>
> > data(sample.ExpressionSet)
> > goSlim(sample.ExpressionSet, oboSlim, "MF")
> Count Percent Term
> GO:0000166 5 0.6157635 nucleotide binding
> GO:0003674 283 34.8522167 molecular_function
> GO:0003676 17 2.0935961 nucleic acid binding
> GO:0003677 10 1.2315271 DNA binding
> GO:0003682 2 0.2463054 chromatin binding
> GO:0003700 1 0.1231527 transcription factor activity
> GO:0003723 5 0.6157635 RNA binding
> GO:0003774 2 0.2463054 motor activity
> GO:0003824 82 10.0985222 catalytic activity
> GO:0004518 5 0.6157635 nuclease activity
> GO:0004871 36 4.4334975 signal transducer activity
> GO:0004872 29 3.5714286 receptor activity
> GO:0005102 15 1.8472906 receptor binding
> GO:0005198 7 0.8620690 structural molecule activity
> GO:0005215 44 5.4187192 transporter activity
> GO:0005488 112 13.7931034 binding
> GO:0005515 49 6.0344828 protein binding
> GO:0008135 1 0.1231527 translation factor activity, nuclei...
> GO:0008289 4 0.4926108 lipid binding
> GO:0016301 13 1.6009852 kinase activity
> GO:0016740 23 2.8325123 transferase activity
> GO:0016787 40 4.9261084 hydrolase activity
> GO:0019825 1 0.1231527 oxygen binding
> GO:0030234 14 1.7241379 enzyme regulator activity
> GO:0030246 3 0.3694581 carbohydrate binding
> GO:0030528 8 0.9852217 transcription regulator activity
> GO:0045182 1 0.1231527 translation regulator activity
>
> I'm not sure I've made the right decisions about duplicated GO terms,
> and would appreciate any feedback. Here's one example:
>
> > goSlim(GOCollection("GO:0000016"), oboSlim, "MF")
> Count Percent Term
> GO:0000166 0 0.00000 nucleotide binding
> GO:0003674 1 33.33333 molecular_function
> GO:0003676 0 0.00000 nucleic acid binding
> GO:0003677 0 0.00000 DNA binding
> GO:0003682 0 0.00000 chromatin binding
> GO:0003700 0 0.00000 transcription factor activity
> GO:0003723 0 0.00000 RNA binding
> GO:0003774 0 0.00000 motor activity
> GO:0003824 1 33.33333 catalytic activity
> GO:0004518 0 0.00000 nuclease activity
> GO:0004871 0 0.00000 signal transducer activity
> GO:0004872 0 0.00000 receptor activity
> GO:0005102 0 0.00000 receptor binding
> GO:0005198 0 0.00000 structural molecule activity
> GO:0005215 0 0.00000 transporter activity
> GO:0005488 0 0.00000 binding
> GO:0005515 0 0.00000 protein binding
> GO:0008135 0 0.00000 translation factor activity, nuclei...
> GO:0008289 0 0.00000 lipid binding
> GO:0016301 0 0.00000 kinase activity
> GO:0016740 0 0.00000 transferase activity
> GO:0016787 1 33.33333 hydrolase activity
> GO:0019825 0 0.00000 oxygen binding
> GO:0030234 0 0.00000 enzyme regulator activity
> GO:0030246 0 0.00000 carbohydrate binding
> GO:0030528 0 0.00000 transcription regulator activity
> GO:0045182 0 0.00000 translation regulator activity
>
> I'm also intending to enable expression set subsetting based on
> GOCollections (you can currently do this with GeneSets defined in
> GSEABase).
>
> Here are a couple of issues.
>
> The hierarchies in oboSlim overlap, so GO:0000016 gets classified
> three different ways. I think this is one of the problems Thomas
> alluded to.
>
> A more complicated example is going from ExpressionSet. Each probe
> might map to several different GO terms (each of which can be
> classified to several different slim terms).
>
> Several features in an ExpressionSet can map to the same GO
> term. Currently, I make the GO terms unique, so this
>
> > goSlim(GOCollection(c("GO:0000016", "GO:0000016")), oboSlim, "MF")
>
> produces the same result as above. The result of goSlim is the
> classification of all unique terms implied by the feature names,
> unweighted by the frequency of the term implied by the feature
> names. Again I'm not sure that this is the most helpful; certainly it
> makes any kind of statistical assessment difficult.
>
> For those wanting to use this kind of functionality without heading
> into the devel branch immediately, here's some code and how to use it:
>
> ## Group GO ids into GO_ontology-specific GO_slim categories
> GO_slim <- function(ids, GO_slim, GO_ontology="MF", verbose=FALSE) {
> require("AnnotationDbi")
> require("GO")
>
> ## Get GO_slim terms, restricted to GO_ontology
> terms <- mget(GO_slim, GOTERM, ifnotfound=NA)
> if (any(is.na(terms))) {
> if (verbose)
> warning("GO_slim ids not found: ",
> paste(names(terms)[is.na(terms)], collapse=" "))
> terms <- terms[!is.na(terms)]
> }
> terms <- terms[sapply(terms, Ontology)==GO_ontology]
> GO_slim <- names(terms)
>
> ## Use GO_ontology to find the required offspring
> OFFSPRING <- switch(GO_ontology,
> MF=GOMFOFFSPRING,
> BP=GOBPOFFSPRING,
> CC=GOCCOFFSPRING,
> stop("GO_ontology must be 'MF', 'BP', or 'CC'"))
> ## Get the offspring of GO_slim
> slim <- mget(GO_slim, OFFSPRING, ifnotfound=NA)
> slim <- slim[!is.na(slim)]
> ## Reverse the relationship: 'offspring' become keys, 'parents'
> ## become values. Select the sampled offspring
> ids <- unique(ids)
> samp <- revmap(slim)[ids]
> samp <- samp[!sapply(samp, is.null)]
> ## Count occurences of each slim
> cnt <- table(unlist(samp))
> ## Adjust for sample ids matching slim ids
> idx <- table(ids[which(ids %in% names(slim))])
> idx_n<- names(idx)
> cnt[idx_n] <- idx + ifelse(is.na(cnt[idx_n]), 0, cnt[idx_n])
>
> ## Prepare a data frame for results
> df <- data.frame(Slim=names(terms),
> Count=0L, Percent=0,
> Term=sprintf("%.35s%s",
> sapply(terms, Term, USE.NAMES=FALSE),
> ifelse(nchar(sapply(terms, Term))>35, "...", "")),
> row.names=1)
> ## add our counts
> df[names(cnt),c("Count", "Percent")] <- c(cnt, 100*cnt/sum(cnt))
> df[order(row.names(df)),]
> }
>
> ## Read 'obo' files for GO ids, from the web or disk
> GO_oboIds <- function(src) {
> ## Parse OBO into 'stanza' and 'kv' (key-value) tables.
> ## VERY NAIVE
> data <- readLines(src)
> parser <- list(stanza="^\\[(.*)\\]", kv="^([^:]*):\\s*(.*)")
> stanza <- data.frame(id=c(0,grep(parser$stanza, data)),
> value=c("Root", sub(parser$stanza, "\\1",
> grep(parser$stanza, data, value=TRUE))),
> stringsAsFactors=FALSE)
> kv_pairs <- grep(parser$kv, data, value=TRUE)
> kv_id <- grep(parser$kv, data)
> stanza_id <- sapply(kv_id, function(x) {
> idx <- x > stanza$id
> stanza$id[xor(idx, c(idx[-1], FALSE))]
> })
> kv <- data.frame(id=kv_id, stanza_id=stanza_id,
> key=sub(parser$kv, "\\1", kv_pairs),
> value=sub(parser$kv, "\\2", kv_pairs),
> stringsAsFactors=FALSE)
>
> ## Get GO ids
> merge(kv[kv$key=="id", names(kv)!="key", drop=FALSE],
> stanza[stanza$value=="Term", names(stanza)!="value",
> drop=FALSE],
> by.x="stanza_id", by.y="id")$value
> }
>
> > ## Thomas Girke's original (Sept 2005) slim ids, from
> > ## http://www.geneontology.org/GO.slims.shtml
> > slimIds <- c("GO:0030246", "GO:0008289", "GO:0003676", "GO:0000166",
> > "GO:0019825", "GO:0005515", "GO:0003824", "GO:0030234",
> > "GO:0005554", "GO:0003774", "GO:0004871", "GO:0005198",
> > "GO:0030528", "GO:0045182", "GO:0005215", "GO:0000004",
> > "GO:0006519", "GO:0007154", "GO:0007049", "GO:0016043",
> > "GO:0006412", "GO:0006464", "GO:0006810", "GO:0007275",
> > "GO:0007049", "GO:0006519", "GO:0005975", "GO:0006629",
> > "GO:0006139", "GO:0019748", "GO:0015979", "GO:0005618",
> > "GO:0005829", "GO:0005783", "GO:0005768", "GO:0005794",
> > "GO:0005739", "GO:0005777", "GO:0009536", "GO:0005840",
> > "GO:0005773", "GO:0005764", "GO:0005856", "GO:0005634",
> > "GO:0005886", "GO:0008372", "GO:0005576")
>
> > ## Some GO ids
> > my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345",
> > "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265",
> > "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
> > "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046",
> > "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618",
> > "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311",
> > "GO:0005622", "GO:0009282")
>
> > ## classify GOIds into slimIds for the "MF" ontology
> > res <- GO_slim(my_GOIds, slimIds, "MF")
> > idx <- res$Count>0
> > pie(res[idx, "Count"], row.names(res)[idx])
>
> > data(sample.ExpressionSet)
> > require(annotation(sample.ExpressionSet))
> > evidenceCode <- "TAS" # restict to particular evidence codes
> > res <- mget(featureNames(sample.ExpressionSet), hgu95av2GO)
> > res <- res[!is.na(res)]
> > gids <- unlist(lapply(res, subListExtract, "GOID"))
> > ecode <- unlist(lapply(res, subListExtract, "Evidence"))
> > ugids <- unique(gids[ecode %in% evidenceCode])
> > df <- GO_slim(ugids, GO_oboIds(src), "MF")
>
> Martin
>
> Thomas Girke <thomas.girke at ucr.edu> writes:
>
> > GO_Slim categories are usually sets of very general custom GO
> > terms. Examples of these GO_Slim sets can be downloaded from
> > geneontology.org:
> >
> > http://www.geneontology.org/GO.slims.shtml
> >
> > When working with GOstats, you can simply use your favorite GO_Slim
> > set for subsetting your enrichment analysis results and then plot
> > the corresponding counts. Usually, GO_Slim representations,
> > especially pie charts, pretend that the different items (e.g. genes)
> > are assigned to only one category - which is typically not the case
> > - since there are duplicates almost everywhere.
> >
> > As a suggestion to the developers:
> > Considering the popularity of these GO_Slim representations, it
> > might be useful to add some instructions in the GOstats PDF that
> > illustrate to users how to generate counts and plots for their
> > favorite GO_Slim categories? Perhaps with a proper warning about the
> > limitations of these analyses.
> >
> > Best,
> >
> > Thomas
> >
> >
> > On Wed 10/10/07 11:23, Celine Carret wrote:
> >> Hi,
> >> you should have a look at this:
> >> http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/goSlim.t
> >> xt
> >>
> >> I found it very clear, and useful, and if you don't work on human-chips
> >> you can customise the script easily.
> >>
> >> Best regards
> >> Celine
> >>
> >> -----Original Message-----
> >> From: bioconductor-bounces at stat.math.ethz.ch
> >> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Ganiraju
> >> Sent: 09 October 2007 20:37
> >> To: bioconductor at stat.math.ethz.ch
> >> Subject: [BioC] GO classification
> >>
> >> hi,
> >>
> >> I got a set of significant GOids by running my data using GOhypergtest
> >> of
> >> Gostats. Now im trying to classify these GO ids using the GO_slim
> >> classifier. Is there any package in R which can accomplish this job??
> >>
> >> Thanks
> >> Gani
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >> --
> >> The Wellcome Trust Sanger Institute is operated by Genome Research
> >> Limited, a charity registered in England with number 1021457 and a
> >> company registered in England with number 2742969, whose registered
> >> office is 215 Euston Road, London, NW1 2BE.
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> > --
> > Thomas Girke
> > Assistant Professor of Bioinformatics
> > Director, IIGB Bioinformatic Facility
> > Center for Plant Cell Biology (CEPCEB)
> > Institute for Integrative Genome Biology (IIGB)
> > Department of Botany and Plant Sciences
> > 1008 Noel T. Keen Hall
> > University of California
> > Riverside, CA 92521
> >
> > E-mail: thomas.girke at ucr.edu
> > Website: http://faculty.ucr.edu/~tgirke
> > Ph: 951-827-2469
> > Fax: 951-827-4437
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Martin Morgan
> Computational Biology Shared Resource Director
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (208) 667-2793
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Thomas Girke
Assistant Professor of Bioinformatics
Director, IIGB Bioinformatic Facility
Center for Plant Cell Biology (CEPCEB)
Institute for Integrative Genome Biology (IIGB)
Department of Botany and Plant Sciences
1008 Noel T. Keen Hall
University of California
Riverside, CA 92521
E-mail: thomas.girke at ucr.edu
Website: http://faculty.ucr.edu/~tgirke
Ph: 951-827-2469
Fax: 951-827-4437
More information about the Bioconductor
mailing list