[BioC] GO classification

Fri Oct 12 22:32:30 CEST 2007

Hi Thomas --

I like the idea of incorporating slims into an existing package (and
have started to in the development version of GSEABase, version 1.1.5,
when it appears). I've been playing around a bit and wanted to get
some feedback from you and the list.

There's an immediately usable version of the underlying code at the
end of the message (no devel version of GSEABase required). The code
below is considerably faster than that posted earlier.

In GSEABase, the function getOBOCollection parses 'obo' files, as
these seem to be how 'slim' sets are sometimes stored. With GSEABase,
you'd

> fl <- "http://www.geneontology.org/GO_slims/goslim_plant.obo"
> oboSlim <- getOBOCollection(fl)
> oboSlim
collectionType: GO 
  ids: GO:0000003, GO:0000166, ..., GO:0045182 (106 total)
  evidenceCode: IMP IPI TAS ISS IDA NAS IEA IGI RCA IEP IC NR ND 

'fl' could be a local file. You could also create an ad hoc collection
with

> slimTerms <- GOCollection(c("GO:0030246", "GO:0008289", "GO:0003676",
+                             "GO:0000166"))

Use the collection of slim ids to classify your own ids,
e.g.,

> my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345",
+               "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265",
+               "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
+               "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046",
+               "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618",
+               "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311",
+               "GO:0005622", "GO:0009282")
> goSlim(GOCollection(my_GOIds), oboSlim, "MF")
           Count   Percent                                   Term
GO:0000166     0  0.000000                     nucleotide binding
GO:0003674     6 40.000000                     molecular_function
GO:0003676     1  6.666667                   nucleic acid binding
GO:0003677     1  6.666667                            DNA binding
GO:0003682     0  0.000000                      chromatin binding
GO:0003700     0  0.000000          transcription factor activity
GO:0003723     0  0.000000                            RNA binding
GO:0003774     0  0.000000                         motor activity
GO:0003824     3 20.000000                     catalytic activity
GO:0004518     0  0.000000                      nuclease activity
GO:0004871     0  0.000000             signal transducer activity
GO:0004872     0  0.000000                      receptor activity
GO:0005102     0  0.000000                       receptor binding
GO:0005198     0  0.000000           structural molecule activity
GO:0005215     0  0.000000                   transporter activity
GO:0005488     2 13.333333                                binding
GO:0005515     0  0.000000                        protein binding
GO:0008135     0  0.000000 translation factor activity, nuclei...
GO:0008289     0  0.000000                          lipid binding
GO:0016301     0  0.000000                        kinase activity
GO:0016740     1  6.666667                   transferase activity
GO:0016787     0  0.000000                     hydrolase activity
GO:0019825     0  0.000000                         oxygen binding
GO:0030234     0  0.000000              enzyme regulator activity
GO:0030246     0  0.000000                   carbohydrate binding
GO:0030528     1  6.666667       transcription regulator activity
GO:0045182     0  0.000000         translation regulator activity

This is just a data frame, with rows being slim ids. More fun (and
needing some more thought -- see below) is asking about the GO
categories implied by an expression set, e.g.,

> data(sample.ExpressionSet)
> goSlim(sample.ExpressionSet, oboSlim, "MF")
           Count    Percent                                   Term
GO:0000166     5  0.6157635                     nucleotide binding
GO:0003674   283 34.8522167                     molecular_function
GO:0003676    17  2.0935961                   nucleic acid binding
GO:0003677    10  1.2315271                            DNA binding
GO:0003682     2  0.2463054                      chromatin binding
GO:0003700     1  0.1231527          transcription factor activity
GO:0003723     5  0.6157635                            RNA binding
GO:0003774     2  0.2463054                         motor activity
GO:0003824    82 10.0985222                     catalytic activity
GO:0004518     5  0.6157635                      nuclease activity
GO:0004871    36  4.4334975             signal transducer activity
GO:0004872    29  3.5714286                      receptor activity
GO:0005102    15  1.8472906                       receptor binding
GO:0005198     7  0.8620690           structural molecule activity
GO:0005215    44  5.4187192                   transporter activity
GO:0005488   112 13.7931034                                binding
GO:0005515    49  6.0344828                        protein binding
GO:0008135     1  0.1231527 translation factor activity, nuclei...
GO:0008289     4  0.4926108                          lipid binding
GO:0016301    13  1.6009852                        kinase activity
GO:0016740    23  2.8325123                   transferase activity
GO:0016787    40  4.9261084                     hydrolase activity
GO:0019825     1  0.1231527                         oxygen binding
GO:0030234    14  1.7241379              enzyme regulator activity
GO:0030246     3  0.3694581                   carbohydrate binding
GO:0030528     8  0.9852217       transcription regulator activity
GO:0045182     1  0.1231527         translation regulator activity

I'm not sure I've made the right decisions about duplicated GO terms,
and would appreciate any feedback. Here's one example:

> goSlim(GOCollection("GO:0000016"), oboSlim, "MF")
           Count  Percent                                   Term
GO:0000166     0  0.00000                     nucleotide binding
GO:0003674     1 33.33333                     molecular_function
GO:0003676     0  0.00000                   nucleic acid binding
GO:0003677     0  0.00000                            DNA binding
GO:0003682     0  0.00000                      chromatin binding
GO:0003700     0  0.00000          transcription factor activity
GO:0003723     0  0.00000                            RNA binding
GO:0003774     0  0.00000                         motor activity
GO:0003824     1 33.33333                     catalytic activity
GO:0004518     0  0.00000                      nuclease activity
GO:0004871     0  0.00000             signal transducer activity
GO:0004872     0  0.00000                      receptor activity
GO:0005102     0  0.00000                       receptor binding
GO:0005198     0  0.00000           structural molecule activity
GO:0005215     0  0.00000                   transporter activity
GO:0005488     0  0.00000                                binding
GO:0005515     0  0.00000                        protein binding
GO:0008135     0  0.00000 translation factor activity, nuclei...
GO:0008289     0  0.00000                          lipid binding
GO:0016301     0  0.00000                        kinase activity
GO:0016740     0  0.00000                   transferase activity
GO:0016787     1 33.33333                     hydrolase activity
GO:0019825     0  0.00000                         oxygen binding
GO:0030234     0  0.00000              enzyme regulator activity
GO:0030246     0  0.00000                   carbohydrate binding
GO:0030528     0  0.00000       transcription regulator activity
GO:0045182     0  0.00000         translation regulator activity

I'm also intending to enable expression set subsetting based on
GOCollections (you can currently do this with GeneSets defined in
GSEABase). 

Here are a couple of issues.

The hierarchies in oboSlim overlap, so GO:0000016 gets classified
three different ways. I think this is one of the problems Thomas
alluded to.

A more complicated example is going from ExpressionSet. Each probe
might map to several different GO terms (each of which can be
classified to several different slim terms).

Several features in an ExpressionSet can map to the same GO
term. Currently, I make the GO terms unique, so this

> goSlim(GOCollection(c("GO:0000016", "GO:0000016")), oboSlim, "MF")

produces the same result as above. The result of goSlim is the
classification of all unique terms implied by the feature names,
unweighted by the frequency of the term implied by the feature
names. Again I'm not sure that this is the most helpful; certainly it
makes any kind of statistical assessment difficult.

For those wanting to use this kind of functionality without heading
into the devel branch immediately, here's some code and how to use it:

## Group GO ids into GO_ontology-specific GO_slim categories
GO_slim <- function(ids, GO_slim, GO_ontology="MF", verbose=FALSE) {
    require("AnnotationDbi")
    require("GO")

    ## Get GO_slim terms, restricted to GO_ontology
    terms <- mget(GO_slim, GOTERM, ifnotfound=NA)
    if (any(is.na(terms))) {
        if (verbose)
            warning("GO_slim ids not found: ",
                    paste(names(terms)[is.na(terms)], collapse=" "))
        terms <- terms[!is.na(terms)]
    }
    terms <- terms[sapply(terms, Ontology)==GO_ontology]
    GO_slim <- names(terms)

    ## Use GO_ontology to find the required offspring
    OFFSPRING <- switch(GO_ontology,
                        MF=GOMFOFFSPRING,
                        BP=GOBPOFFSPRING,
                        CC=GOCCOFFSPRING,
                        stop("GO_ontology must be 'MF', 'BP', or 'CC'"))
    ## Get the offspring of GO_slim
    slim <- mget(GO_slim, OFFSPRING, ifnotfound=NA)
    slim <- slim[!is.na(slim)]
    ## Reverse the relationship: 'offspring' become keys, 'parents'
    ## become values. Select the sampled offspring
    ids <- unique(ids)
    samp <- revmap(slim)[ids]
    samp <- samp[!sapply(samp, is.null)]
    ## Count occurences of each slim
    cnt <- table(unlist(samp))
    ## Adjust for sample ids matching slim ids
    idx <- table(ids[which(ids %in% names(slim))])
    idx_n<- names(idx)
    cnt[idx_n] <- idx + ifelse(is.na(cnt[idx_n]), 0, cnt[idx_n])

    ## Prepare a data frame for results
    df <- data.frame(Slim=names(terms),
                     Count=0L, Percent=0,
                     Term=sprintf("%.35s%s",
                       sapply(terms, Term, USE.NAMES=FALSE),
                       ifelse(nchar(sapply(terms, Term))>35, "...", "")),
                     row.names=1)
    ## add our counts
    df[names(cnt),c("Count", "Percent")] <- c(cnt, 100*cnt/sum(cnt))
    df[order(row.names(df)),]
}

## Read 'obo' files for GO ids, from the web or disk
GO_oboIds <- function(src) {
    ## Parse OBO into 'stanza' and 'kv' (key-value) tables.
    ## VERY NAIVE
    data <- readLines(src)
    parser <- list(stanza="^\\[(.*)\\]", kv="^([^:]*):\\s*(.*)")
    stanza <- data.frame(id=c(0,grep(parser$stanza, data)),
                         value=c("Root", sub(parser$stanza, "\\1",
                           grep(parser$stanza, data, value=TRUE))),
                         stringsAsFactors=FALSE)
    kv_pairs <- grep(parser$kv, data, value=TRUE)
    kv_id <- grep(parser$kv, data)
    stanza_id <- sapply(kv_id, function(x) {
        idx <- x > stanza$id
        stanza$id[xor(idx, c(idx[-1], FALSE))]
    })
    kv <- data.frame(id=kv_id, stanza_id=stanza_id,
                     key=sub(parser$kv, "\\1", kv_pairs),
                     value=sub(parser$kv, "\\2", kv_pairs),
                     stringsAsFactors=FALSE)

    ## Get GO ids
    merge(kv[kv$key=="id", names(kv)!="key", drop=FALSE],
          stanza[stanza$value=="Term", names(stanza)!="value",
                 drop=FALSE],
          by.x="stanza_id", by.y="id")$value
}

> ## Thomas Girke's original (Sept 2005) slim ids, from
> ## http://www.geneontology.org/GO.slims.shtml
> slimIds <- c("GO:0030246", "GO:0008289", "GO:0003676", "GO:0000166",
>              "GO:0019825", "GO:0005515", "GO:0003824", "GO:0030234",
>              "GO:0005554", "GO:0003774", "GO:0004871", "GO:0005198",
>              "GO:0030528", "GO:0045182", "GO:0005215", "GO:0000004",
>              "GO:0006519", "GO:0007154", "GO:0007049", "GO:0016043",
>              "GO:0006412", "GO:0006464", "GO:0006810", "GO:0007275",
>              "GO:0007049", "GO:0006519", "GO:0005975", "GO:0006629",
>              "GO:0006139", "GO:0019748", "GO:0015979", "GO:0005618",
>              "GO:0005829", "GO:0005783", "GO:0005768", "GO:0005794",
>              "GO:0005739", "GO:0005777", "GO:0009536", "GO:0005840",
>              "GO:0005773", "GO:0005764", "GO:0005856", "GO:0005634",
>              "GO:0005886", "GO:0008372", "GO:0005576")

> ## Some GO ids
> my_GOIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0004345",
>               "GO:0004345", "GO:0004345", "GO:0004345", "GO:0008265",
>               "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
>               "GO:0006412", "GO:0006412", "GO:0006412", "GO:0007046",
>               "GO:0015979", "GO:0006457", "GO:0008372", "GO:0005618",
>               "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311",
>               "GO:0005622", "GO:0009282")

> ## classify GOIds into slimIds for the "MF" ontology
> res <- GO_slim(my_GOIds, slimIds, "MF")
> idx <- res$Count>0
> pie(res[idx, "Count"], row.names(res)[idx])

> data(sample.ExpressionSet)
> require(annotation(sample.ExpressionSet))
> evidenceCode <- "TAS" # restict to particular evidence codes
> res <- mget(featureNames(sample.ExpressionSet), hgu95av2GO)
> res <- res[!is.na(res)]
> gids <- unlist(lapply(res, subListExtract, "GOID"))
> ecode <- unlist(lapply(res, subListExtract, "Evidence"))
> ugids <- unique(gids[ecode %in% evidenceCode])
> df <- GO_slim(ugids, GO_oboIds(src), "MF")

Martin

Thomas Girke <thomas.girke at ucr.edu> writes:

> GO_Slim categories are usually sets of very general custom GO
> terms. Examples of these GO_Slim sets can be downloaded from
> geneontology.org:
>
> 	http://www.geneontology.org/GO.slims.shtml
>
> When working with GOstats, you can simply use your favorite GO_Slim
> set for subsetting your enrichment analysis results and then plot
> the corresponding counts. Usually, GO_Slim representations,
> especially pie charts, pretend that the different items (e.g. genes)
> are assigned to only one category - which is typically not the case
> - since there are duplicates almost everywhere.
>
> As a suggestion to the developers:
> Considering the popularity of these GO_Slim representations, it
> might be useful to add some instructions in the GOstats PDF that
> illustrate to users how to generate counts and plots for their
> favorite GO_Slim categories? Perhaps with a proper warning about the
> limitations of these analyses.
>
> Best,
>
> Thomas
>
>
> On Wed 10/10/07 11:23, Celine Carret wrote:
>> Hi, 
>> you should have a look at this:
>> http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/goSlim.t
>> xt
>> 
>> I found it very clear, and useful, and if you don't work on human-chips
>> you can customise the script easily.
>> 
>> Best regards
>> Celine
>> 
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch
>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Ganiraju
>> Sent: 09 October 2007 20:37
>> To: bioconductor at stat.math.ethz.ch
>> Subject: [BioC] GO classification
>> 
>> hi,
>> 
>> I got a set of  significant GOids by running my data using GOhypergtest
>> of
>> Gostats. Now im trying to classify these GO ids using the GO_slim
>> classifier. Is there any package in R which can accomplish this job??
>> 
>> Thanks
>> Gani
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
>> -- 
>>  The Wellcome Trust Sanger Institute is operated by Genome Research 
>>  Limited, a charity registered in England with number 1021457 and a 
>>  company registered in England with number 2742969, whose registered 
>>  office is 215 Euston Road, London, NW1 2BE.
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>
> -- 
> Thomas Girke
> Assistant Professor of Bioinformatics
> Director, IIGB Bioinformatic Facility
> Center for Plant Cell Biology (CEPCEB)
> Institute for Integrative Genome Biology (IIGB)
> Department of Botany and Plant Sciences
> 1008 Noel T. Keen Hall
> University of California
> Riverside, CA 92521
>
> E-mail: thomas.girke at ucr.edu
> Website: http://faculty.ucr.edu/~tgirke
> Ph: 951-827-2469
> Fax: 951-827-4437
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology Shared Resource Director
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (208) 667-2793