[BioC] GO classification

Fri Oct 12 23:55:52 CEST 2007

Hi Martin, 

This preliminary implementation of a GO_Slim function in GSEABase 
looks great to me. I particularly like the option to support the 
usage of custom GO_Slim sets. 

One suggestion regarding the removal of duplicates:
I suggest to make their removal optional on all different levels
(GO-to-Slim and gene-to-GO), and set the default to no duplicate
removal on any level. This, I think, is the most common way these
analyses are performed or how most users would like to perform them.
But I might we wrong here?   

Best,

Thomas

On Fri 10/12/07 13:32, Martin Morgan wrote:
> Hi Thomas --
> 
> I like the idea of incorporating slims into an existing package (and
> have started to in the development version of GSEABase, version 1.1.5,
> when it appears). I've been playing around a bit and wanted to get
> some feedback from you and the list.
> 
> There's an immediately usable version of the underlying code at the
> end of the message (no devel version of GSEABase required). The code
> below is considerably faster than that posted earlier.
> 
> In GSEABase, the function getOBOCollection parses 'obo' files, as
> these seem to be how 'slim' sets are sometimes stored. With GSEABase,
> you'd
> 
> > fl <- "http://www.geneontology.org/GO_slims/goslim_plant.obo"
> > oboSlim <- getOBOCollection(fl)
> > oboSlim
> collectionType: GO 
>   ids: GO:0000003, GO:0000166, ..., GO:0045182 (106 total)
>   evidenceCode: IMP IPI TAS ISS IDA NAS IEA IGI RCA IEP IC NR ND 
> 
> 'fl' could be a local file. You could also create an ad hoc collection
> with
> 
> > slimTerms <- GOCollection(c("GO:0030246", "GO:0008289", "GO:0003676",
> +                             "GO:0000166"))
> 
> Use the collection of slim ids to classify your own ids,
> e.g.,
> 
> > my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345",
> +               "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265",
> +               "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
> +               "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046",
> +               "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618",
> +               "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311",
> +               "GO:0005622", "GO:0009282")
> > goSlim(GOCollection(my_GOIds), oboSlim, "MF")
>            Count   Percent                                   Term
> GO:0000166     0  0.000000                     nucleotide binding
> GO:0003674     6 40.000000                     molecular_function
> GO:0003676     1  6.666667                   nucleic acid binding
> GO:0003677     1  6.666667                            DNA binding
> GO:0003682     0  0.000000                      chromatin binding
> GO:0003700     0  0.000000          transcription factor activity
> GO:0003723     0  0.000000                            RNA binding
> GO:0003774     0  0.000000                         motor activity
> GO:0003824     3 20.000000                     catalytic activity
> GO:0004518     0  0.000000                      nuclease activity
> GO:0004871     0  0.000000             signal transducer activity
> GO:0004872     0  0.000000                      receptor activity
> GO:0005102     0  0.000000                       receptor binding
> GO:0005198     0  0.000000           structural molecule activity
> GO:0005215     0  0.000000                   transporter activity
> GO:0005488     2 13.333333                                binding
> GO:0005515     0  0.000000                        protein binding
> GO:0008135     0  0.000000 translation factor activity, nuclei...
> GO:0008289     0  0.000000                          lipid binding
> GO:0016301     0  0.000000                        kinase activity
> GO:0016740     1  6.666667                   transferase activity
> GO:0016787     0  0.000000                     hydrolase activity
> GO:0019825     0  0.000000                         oxygen binding
> GO:0030234     0  0.000000              enzyme regulator activity
> GO:0030246     0  0.000000                   carbohydrate binding
> GO:0030528     1  6.666667       transcription regulator activity
> GO:0045182     0  0.000000         translation regulator activity
> 
> This is just a data frame, with rows being slim ids. More fun (and
> needing some more thought -- see below) is asking about the GO
> categories implied by an expression set, e.g.,
> 
> > data(sample.ExpressionSet)
> > goSlim(sample.ExpressionSet, oboSlim, "MF")
>            Count    Percent                                   Term
> GO:0000166     5  0.6157635                     nucleotide binding
> GO:0003674   283 34.8522167                     molecular_function
> GO:0003676    17  2.0935961                   nucleic acid binding
> GO:0003677    10  1.2315271                            DNA binding
> GO:0003682     2  0.2463054                      chromatin binding
> GO:0003700     1  0.1231527          transcription factor activity
> GO:0003723     5  0.6157635                            RNA binding
> GO:0003774     2  0.2463054                         motor activity
> GO:0003824    82 10.0985222                     catalytic activity
> GO:0004518     5  0.6157635                      nuclease activity
> GO:0004871    36  4.4334975             signal transducer activity
> GO:0004872    29  3.5714286                      receptor activity
> GO:0005102    15  1.8472906                       receptor binding
> GO:0005198     7  0.8620690           structural molecule activity
> GO:0005215    44  5.4187192                   transporter activity
> GO:0005488   112 13.7931034                                binding
> GO:0005515    49  6.0344828                        protein binding
> GO:0008135     1  0.1231527 translation factor activity, nuclei...
> GO:0008289     4  0.4926108                          lipid binding
> GO:0016301    13  1.6009852                        kinase activity
> GO:0016740    23  2.8325123                   transferase activity
> GO:0016787    40  4.9261084                     hydrolase activity
> GO:0019825     1  0.1231527                         oxygen binding
> GO:0030234    14  1.7241379              enzyme regulator activity
> GO:0030246     3  0.3694581                   carbohydrate binding
> GO:0030528     8  0.9852217       transcription regulator activity
> GO:0045182     1  0.1231527         translation regulator activity
> 
> I'm not sure I've made the right decisions about duplicated GO terms,
> and would appreciate any feedback. Here's one example:
> 
> > goSlim(GOCollection("GO:0000016"), oboSlim, "MF")
>            Count  Percent                                   Term
> GO:0000166     0  0.00000                     nucleotide binding
> GO:0003674     1 33.33333                     molecular_function
> GO:0003676     0  0.00000                   nucleic acid binding
> GO:0003677     0  0.00000                            DNA binding
> GO:0003682     0  0.00000                      chromatin binding
> GO:0003700     0  0.00000          transcription factor activity
> GO:0003723     0  0.00000                            RNA binding
> GO:0003774     0  0.00000                         motor activity
> GO:0003824     1 33.33333                     catalytic activity
> GO:0004518     0  0.00000                      nuclease activity
> GO:0004871     0  0.00000             signal transducer activity
> GO:0004872     0  0.00000                      receptor activity
> GO:0005102     0  0.00000                       receptor binding
> GO:0005198     0  0.00000           structural molecule activity
> GO:0005215     0  0.00000                   transporter activity
> GO:0005488     0  0.00000                                binding
> GO:0005515     0  0.00000                        protein binding
> GO:0008135     0  0.00000 translation factor activity, nuclei...
> GO:0008289     0  0.00000                          lipid binding
> GO:0016301     0  0.00000                        kinase activity
> GO:0016740     0  0.00000                   transferase activity
> GO:0016787     1 33.33333                     hydrolase activity
> GO:0019825     0  0.00000                         oxygen binding
> GO:0030234     0  0.00000              enzyme regulator activity
> GO:0030246     0  0.00000                   carbohydrate binding
> GO:0030528     0  0.00000       transcription regulator activity
> GO:0045182     0  0.00000         translation regulator activity
> 
> I'm also intending to enable expression set subsetting based on
> GOCollections (you can currently do this with GeneSets defined in
> GSEABase). 
> 
> Here are a couple of issues.
> 
> The hierarchies in oboSlim overlap, so GO:0000016 gets classified
> three different ways. I think this is one of the problems Thomas
> alluded to.
> 
> A more complicated example is going from ExpressionSet. Each probe
> might map to several different GO terms (each of which can be
> classified to several different slim terms).
> 
> Several features in an ExpressionSet can map to the same GO
> term. Currently, I make the GO terms unique, so this
> 
> > goSlim(GOCollection(c("GO:0000016", "GO:0000016")), oboSlim, "MF")
> 
> produces the same result as above. The result of goSlim is the
> classification of all unique terms implied by the feature names,
> unweighted by the frequency of the term implied by the feature
> names. Again I'm not sure that this is the most helpful; certainly it
> makes any kind of statistical assessment difficult.
> 
> For those wanting to use this kind of functionality without heading
> into the devel branch immediately, here's some code and how to use it:
> 
> ## Group GO ids into GO_ontology-specific GO_slim categories
> GO_slim <- function(ids, GO_slim, GO_ontology="MF", verbose=FALSE) {
>     require("AnnotationDbi")
>     require("GO")
> 
>     ## Get GO_slim terms, restricted to GO_ontology
>     terms <- mget(GO_slim, GOTERM, ifnotfound=NA)
>     if (any(is.na(terms))) {
>         if (verbose)
>             warning("GO_slim ids not found: ",
>                     paste(names(terms)[is.na(terms)], collapse=" "))
>         terms <- terms[!is.na(terms)]
>     }
>     terms <- terms[sapply(terms, Ontology)==GO_ontology]
>     GO_slim <- names(terms)
> 
>     ## Use GO_ontology to find the required offspring
>     OFFSPRING <- switch(GO_ontology,
>                         MF=GOMFOFFSPRING,
>                         BP=GOBPOFFSPRING,
>                         CC=GOCCOFFSPRING,
>                         stop("GO_ontology must be 'MF', 'BP', or 'CC'"))
>     ## Get the offspring of GO_slim
>     slim <- mget(GO_slim, OFFSPRING, ifnotfound=NA)
>     slim <- slim[!is.na(slim)]
>     ## Reverse the relationship: 'offspring' become keys, 'parents'
>     ## become values. Select the sampled offspring
>     ids <- unique(ids)
>     samp <- revmap(slim)[ids]
>     samp <- samp[!sapply(samp, is.null)]
>     ## Count occurences of each slim
>     cnt <- table(unlist(samp))
>     ## Adjust for sample ids matching slim ids
>     idx <- table(ids[which(ids %in% names(slim))])
>     idx_n<- names(idx)
>     cnt[idx_n] <- idx + ifelse(is.na(cnt[idx_n]), 0, cnt[idx_n])
> 
>     ## Prepare a data frame for results
>     df <- data.frame(Slim=names(terms),
>                      Count=0L, Percent=0,
>                      Term=sprintf("%.35s%s",
>                        sapply(terms, Term, USE.NAMES=FALSE),
>                        ifelse(nchar(sapply(terms, Term))>35, "...", "")),
>                      row.names=1)
>     ## add our counts
>     df[names(cnt),c("Count", "Percent")] <- c(cnt, 100*cnt/sum(cnt))
>     df[order(row.names(df)),]
> }
> 
> ## Read 'obo' files for GO ids, from the web or disk
> GO_oboIds <- function(src) {
>     ## Parse OBO into 'stanza' and 'kv' (key-value) tables.
>     ## VERY NAIVE
>     data <- readLines(src)
>     parser <- list(stanza="^\\[(.*)\\]", kv="^([^:]*):\\s*(.*)")
>     stanza <- data.frame(id=c(0,grep(parser$stanza, data)),
>                          value=c("Root", sub(parser$stanza, "\\1",
>                            grep(parser$stanza, data, value=TRUE))),
>                          stringsAsFactors=FALSE)
>     kv_pairs <- grep(parser$kv, data, value=TRUE)
>     kv_id <- grep(parser$kv, data)
>     stanza_id <- sapply(kv_id, function(x) {
>         idx <- x > stanza$id
>         stanza$id[xor(idx, c(idx[-1], FALSE))]
>     })
>     kv <- data.frame(id=kv_id, stanza_id=stanza_id,
>                      key=sub(parser$kv, "\\1", kv_pairs),
>                      value=sub(parser$kv, "\\2", kv_pairs),
>                      stringsAsFactors=FALSE)
> 
>     ## Get GO ids
>     merge(kv[kv$key=="id", names(kv)!="key", drop=FALSE],
>           stanza[stanza$value=="Term", names(stanza)!="value",
>                  drop=FALSE],
>           by.x="stanza_id", by.y="id")$value
> }
> 
> > ## Thomas Girke's original (Sept 2005) slim ids, from
> > ## http://www.geneontology.org/GO.slims.shtml
> > slimIds <- c("GO:0030246", "GO:0008289", "GO:0003676", "GO:0000166",
> >              "GO:0019825", "GO:0005515", "GO:0003824", "GO:0030234",
> >              "GO:0005554", "GO:0003774", "GO:0004871", "GO:0005198",
> >              "GO:0030528", "GO:0045182", "GO:0005215", "GO:0000004",
> >              "GO:0006519", "GO:0007154", "GO:0007049", "GO:0016043",
> >              "GO:0006412", "GO:0006464", "GO:0006810", "GO:0007275",
> >              "GO:0007049", "GO:0006519", "GO:0005975", "GO:0006629",
> >              "GO:0006139", "GO:0019748", "GO:0015979", "GO:0005618",
> >              "GO:0005829", "GO:0005783", "GO:0005768", "GO:0005794",
> >              "GO:0005739", "GO:0005777", "GO:0009536", "GO:0005840",
> >              "GO:0005773", "GO:0005764", "GO:0005856", "GO:0005634",
> >              "GO:0005886", "GO:0008372", "GO:0005576")
> 
> > ## Some GO ids
> > my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345",
> >               "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265",
> >               "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
> >               "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046",
> >               "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618",
> >               "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311",
> >               "GO:0005622", "GO:0009282")
> 
> > ## classify GOIds into slimIds for the "MF" ontology
> > res <- GO_slim(my_GOIds, slimIds, "MF")
> > idx <- res$Count>0
> > pie(res[idx, "Count"], row.names(res)[idx])
> 
> > data(sample.ExpressionSet)
> > require(annotation(sample.ExpressionSet))
> > evidenceCode <- "TAS" # restict to particular evidence codes
> > res <- mget(featureNames(sample.ExpressionSet), hgu95av2GO)
> > res <- res[!is.na(res)]
> > gids <- unlist(lapply(res, subListExtract, "GOID"))
> > ecode <- unlist(lapply(res, subListExtract, "Evidence"))
> > ugids <- unique(gids[ecode %in% evidenceCode])
> > df <- GO_slim(ugids, GO_oboIds(src), "MF")
> 
> Martin
> 
> Thomas Girke <thomas.girke at ucr.edu> writes:
> 
> > GO_Slim categories are usually sets of very general custom GO
> > terms. Examples of these GO_Slim sets can be downloaded from
> > geneontology.org:
> >
> > 	http://www.geneontology.org/GO.slims.shtml
> >
> > When working with GOstats, you can simply use your favorite GO_Slim
> > set for subsetting your enrichment analysis results and then plot
> > the corresponding counts. Usually, GO_Slim representations,
> > especially pie charts, pretend that the different items (e.g. genes)
> > are assigned to only one category - which is typically not the case
> > - since there are duplicates almost everywhere.
> >
> > As a suggestion to the developers:
> > Considering the popularity of these GO_Slim representations, it
> > might be useful to add some instructions in the GOstats PDF that
> > illustrate to users how to generate counts and plots for their
> > favorite GO_Slim categories? Perhaps with a proper warning about the
> > limitations of these analyses.
> >
> > Best,
> >
> > Thomas
> >
> >
> > On Wed 10/10/07 11:23, Celine Carret wrote:
> >> Hi, 
> >> you should have a look at this:
> >> http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/goSlim.t
> >> xt
> >> 
> >> I found it very clear, and useful, and if you don't work on human-chips
> >> you can customise the script easily.
> >> 
> >> Best regards
> >> Celine
> >> 
> >> -----Original Message-----
> >> From: bioconductor-bounces at stat.math.ethz.ch
> >> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Ganiraju
> >> Sent: 09 October 2007 20:37
> >> To: bioconductor at stat.math.ethz.ch
> >> Subject: [BioC] GO classification
> >> 
> >> hi,
> >> 
> >> I got a set of  significant GOids by running my data using GOhypergtest
> >> of
> >> Gostats. Now im trying to classify these GO ids using the GO_slim
> >> classifier. Is there any package in R which can accomplish this job??
> >> 
> >> Thanks
> >> Gani
> >> 
> >> 	[[alternative HTML version deleted]]
> >> 
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >> 
> >> 
> >> -- 
> >>  The Wellcome Trust Sanger Institute is operated by Genome Research 
> >>  Limited, a charity registered in England with number 1021457 and a 
> >>  company registered in England with number 2742969, whose registered 
> >>  office is 215 Euston Road, London, NW1 2BE.
> >> 
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >> 
> >
> > -- 
> > Thomas Girke
> > Assistant Professor of Bioinformatics
> > Director, IIGB Bioinformatic Facility
> > Center for Plant Cell Biology (CEPCEB)
> > Institute for Integrative Genome Biology (IIGB)
> > Department of Botany and Plant Sciences
> > 1008 Noel T. Keen Hall
> > University of California
> > Riverside, CA 92521
> >
> > E-mail: thomas.girke at ucr.edu
> > Website: http://faculty.ucr.edu/~tgirke
> > Ph: 951-827-2469
> > Fax: 951-827-4437
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> -- 
> Martin Morgan
> Computational Biology Shared Resource Director
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M2 B169
> Phone: (208) 667-2793
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Thomas Girke
Assistant Professor of Bioinformatics
Director, IIGB Bioinformatic Facility
Center for Plant Cell Biology (CEPCEB)
Institute for Integrative Genome Biology (IIGB)
Department of Botany and Plant Sciences
1008 Noel T. Keen Hall
University of California
Riverside, CA 92521

E-mail: thomas.girke at ucr.edu
Website: http://faculty.ucr.edu/~tgirke
Ph: 951-827-2469
Fax: 951-827-4437