[BioC] GO.db: how to get GO Term

Wed Jun 24 10:19:49 CEST 2009

Interesting, thanks.

vQ

> Hi Gordon, Martin, Wacek,
>
> Martin Morgan wrote:
>> Wacek Kusnierczyk wrote:
>>> Marc Carlson wrote:
>>>> One thing you can do to make this more efficient is to use mget
>>>> instead
>>>> of as.list().  That way you won't be pulling the whole mapping out of
>>>> the database into a list just to get one thing back out.
>>>>
>>>> mget("GO:0000166",GOTERM,ifnotfound=NA)
>>>>
>>>> Also, with mget() you can pass in multiple accessions into the 1st
>>>> argument and it will just hand you a longer list back.
>>>>
>>>> mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA)
>>>>
>>> just being curious, i have checked the performance of all three
>>> solutions posted on this list:
>>>
>>>    library(GO.db)
>>>    library(rbenchmark)
>>>
>>>    ids = sapply(sample(GOTERM, 100), GOID)
>>>    print(
>>>        benchmark(replications=100, columns=c('test', 'elapsed'),
>>>            eapply=eapply(GOTERM[ids], Term),
>>>            lapply=lapply(as.list(GOTERM[ids]), Term),
>>>            mget=lapply(mget(ids, GOTERM), Term)))
>>>
>>>    #     test elapsed
>>>    # 3 eapply  10.925
>>>    # 1 lapply  11.091
>>>    # 2   mget  11.160
>>>
>>> it appears that they are (with the particular data sample used)
>>> virtually equivalent in efficiency.
>>
>> I'm not the definitive source for this, but I guess the performance is
>> dominated by, on the one hand, creating S4 instances for each table
>> entry (e.g., in as.list), and on the other immediately extracting a slot
>> from the created S4 object.
>>
>> With this in mind, I thought one could do
>>
>>     tbl <- toTable(GOTERM[ids])
>>     res1 <- with(tbl, Term[!duplicated(go_id)])
>>     identical(sort(unlist(res0)), sort(res1))
>>
>> This is about 10x faster, but now I'm starting to appreciate some of the
>> work the software is doing -- there are duplicate go_ids returned by
>> toTable, corresponding to synonyms for the terms I've entered.
>>
>> Inspired by this success, I looked at the underlying SQL schema (with
>> GO_dbschema()) and intercepted at few calls to the db (with
>> debug(dbGetQuery)) to arrive at this
>>
>>     sql <- sprintf("SELECT DISTINCT term
>>                     FROM go_term
>>                        LEFT JOIN go_synonym
>>                        ON go_term._id=go_synonym._id
>>                     WHERE go_term.go_id IN ('%s');",
>>                    paste(ids, collapse="','"))
>>     res2 <- dbGetQuery(GO_dbconn(), sql)[[1]]
>
> WARNING: There is no guarantee that the data.frame returned by
> dbGetQuery()
> will have its rows sorted in the same order than the ids injected in the
> IN
> clause of the SQL statement. So, here 'res2' is of length 100 but the
> mapping between 'ids' and 'res2' is lost. Note that this is not a
> dbGetQuery()
> feature but an SQL feature (the same would apply from the sqlite3 command
> line client).
>
> Also note that eapply() has the same kind of problem: one needs to be
> careful with the result of eapply(GOTERM[ids], Term) because, strictly
> speaking, there is no reason why the returned list should have its
> elements in the same order as the keys in 'ids'. This behaviour follows
> in fact what the original eapply() does on a *real* environment
> where, according to the man page, "the output is not sorted" even though
> it seems that it is sorted in the lexicographic order of the symbols
> defined in the environment. But since this is not documented, no code
> should rely on this. Anyway, IMO eapply() should not be used on Bimap
> objects because of this confusing behaviour.
>
>>     identical(sort(res1), sort(res2))
>>
>> another 2x gain in speed, but also really paying a significant price in
>> terms of responsibility for what the code is doing.
>
> Even faster (DISTINCT and JOIN not needed):
>
>    GOid2Term <- function(ids)
>    {
>      sql <- sprintf("SELECT go_id, term
>                      FROM go_term
>                      WHERE go_id IN ('%s')",
>                      paste(ids, collapse="','"))
>      res <- dbGetQuery(GO_dbconn(), sql)
>      ans <- res[[2]]
>      names(ans) <- res[[1]]
>      unname(ans[ids])
>    }
>
> Cheers,
> H.
>
>
>>
>> Martin
>>
>>> vQ
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>