[BioC] GO.db: how to get GO Term

Hervé Pagès hpages at fhcrc.org
Wed Jun 24 02:19:04 CEST 2009


Hi Gordon, Martin, Wacek,

Martin Morgan wrote:
> Wacek Kusnierczyk wrote:
>> Marc Carlson wrote:
>>> One thing you can do to make this more efficient is to use mget instead
>>> of as.list().  That way you won't be pulling the whole mapping out of
>>> the database into a list just to get one thing back out.
>>>
>>> mget("GO:0000166",GOTERM,ifnotfound=NA)
>>>
>>> Also, with mget() you can pass in multiple accessions into the 1st
>>> argument and it will just hand you a longer list back.
>>>
>>> mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA)
>>>   
>> just being curious, i have checked the performance of all three
>> solutions posted on this list:
>>
>>    library(GO.db)
>>    library(rbenchmark)
>>
>>    ids = sapply(sample(GOTERM, 100), GOID)
>>    print(
>>        benchmark(replications=100, columns=c('test', 'elapsed'),
>>            eapply=eapply(GOTERM[ids], Term),
>>            lapply=lapply(as.list(GOTERM[ids]), Term),
>>            mget=lapply(mget(ids, GOTERM), Term)))
>>
>>    #     test elapsed
>>    # 3 eapply  10.925
>>    # 1 lapply  11.091
>>    # 2   mget  11.160
>>
>> it appears that they are (with the particular data sample used)
>> virtually equivalent in efficiency.
> 
> I'm not the definitive source for this, but I guess the performance is
> dominated by, on the one hand, creating S4 instances for each table
> entry (e.g., in as.list), and on the other immediately extracting a slot
> from the created S4 object.
> 
> With this in mind, I thought one could do
> 
>     tbl <- toTable(GOTERM[ids])
>     res1 <- with(tbl, Term[!duplicated(go_id)])
>     identical(sort(unlist(res0)), sort(res1))
> 
> This is about 10x faster, but now I'm starting to appreciate some of the
> work the software is doing -- there are duplicate go_ids returned by
> toTable, corresponding to synonyms for the terms I've entered.
> 
> Inspired by this success, I looked at the underlying SQL schema (with
> GO_dbschema()) and intercepted at few calls to the db (with
> debug(dbGetQuery)) to arrive at this
> 
>     sql <- sprintf("SELECT DISTINCT term
>                     FROM go_term
>                        LEFT JOIN go_synonym
>                        ON go_term._id=go_synonym._id
>                     WHERE go_term.go_id IN ('%s');",
>                    paste(ids, collapse="','"))
>     res2 <- dbGetQuery(GO_dbconn(), sql)[[1]]

WARNING: There is no guarantee that the data.frame returned by dbGetQuery()
will have its rows sorted in the same order than the ids injected in the IN
clause of the SQL statement. So, here 'res2' is of length 100 but the
mapping between 'ids' and 'res2' is lost. Note that this is not a dbGetQuery()
feature but an SQL feature (the same would apply from the sqlite3 command
line client).

Also note that eapply() has the same kind of problem: one needs to be
careful with the result of eapply(GOTERM[ids], Term) because, strictly
speaking, there is no reason why the returned list should have its
elements in the same order as the keys in 'ids'. This behaviour follows
in fact what the original eapply() does on a *real* environment
where, according to the man page, "the output is not sorted" even though
it seems that it is sorted in the lexicographic order of the symbols
defined in the environment. But since this is not documented, no code
should rely on this. Anyway, IMO eapply() should not be used on Bimap
objects because of this confusing behaviour.

>     identical(sort(res1), sort(res2))
> 
> another 2x gain in speed, but also really paying a significant price in
> terms of responsibility for what the code is doing.

Even faster (DISTINCT and JOIN not needed):

   GOid2Term <- function(ids)
   {
     sql <- sprintf("SELECT go_id, term
                     FROM go_term
                     WHERE go_id IN ('%s')",
                     paste(ids, collapse="','"))
     res <- dbGetQuery(GO_dbconn(), sql)
     ans <- res[[2]]
     names(ans) <- res[[1]]
     unname(ans[ids])
   }

Cheers,
H.


> 
> Martin
> 
>> vQ
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list