[BioC] question about ontoCompare() performance change
Scott Markel
SMarkel at accelrys.com
Fri Nov 13 04:39:24 CET 2009
Seth,
Thank you for your analysis and the initial pass at a replacement
implementation. Much appreciated.
Scott
Scott Markel, Ph.D.
Principal Bioinformatics Architect email: smarkel at accelrys.com
Accelrys (SciTegic R&D) mobile: +1 858 205 3653
10188 Telesis Court, Suite 100 voice: +1 858 799 5603
San Diego, CA 92121 fax: +1 858 799 5222
USA web: http://www.accelrys.com
http://www.linkedin.com/in/smarkel
Vice President, Board of Directors:
International Society for Computational Biology
Chair: ISCB Publications Committee
Associate Editor: PLoS Computational Biology
Editorial Board: Briefings in Bioinformatics
-----Original Message-----
From: Seth Falcon [mailto:sfalcon at fhcrc.org]
Sent: Thursday, 12 November 2009 1:44 PM
To: Scott Markel
Cc: bioconductor at stat.math.ethz.ch; Agnes Paquet
Subject: Re: [BioC] question about ontoCompare() performance change
Hi again,
On 10/29/09 10:26 AM, Seth Falcon wrote:
> Thanks for the reminder and providing a reproducible example. We will
> take a look and see if we can understand and provide a fix for the
> slow down.
The goTools::ontoCompare function as currently coded takes "the long way" at a couple of points when dealing with the GO annotation in the GO.db package. Unfortunately, I don't see an easy way to make just a few small changes to the existing function. I believe a significant refactoring is required.
To that end, I've attempted to understand the main goal of the ontoCompare function and to reproduce some of the functionality with a different coding approach. My intention is to get things started, not to furnish a complete fix. I have attached an R file containing functions for an alternate implementation. Here's a summary:
## start out by executing a sample with current goTools code
library("goTools")
library("hgu133a.db")
data(probeID)
system.time(z0 <- ontoCompare(list(L1=affylist[[1]]), "hgu133a",
method="none"))
Starting ontoCompare...
user system elapsed
1280.047 21.033 1320.269
## Now demonstrate alternate
system.time(zz <- goCompare(affylist[[1]], "hgu133a"))
user system elapsed
14.712 0.116 15.154
Warning message:
In probeToGO(probes, probeType, ontology) :
removing 15 probe IDs with no mapping to GO
As you can see, the alternate is faster. *However*, I haven't taken the time to completely re-implement the original function and worse, I get slightly different results. You can use the following to compare:
zz[["Term"]] = sapply(zz$GO, function(x) Term(GOTERM[[x]]),
USE.NAMES=FALSE)
inboth <- intersect(rownames(z0), zz$Term)
zz[["OrigCount"]] <- as.integer(NA)
zz[match(inboth, zz$Term, nomatch=0L), "OrigCount"]
<- as.integer(z0[inboth, ])
zz[, c("Ontology", "Term", "OrigCount", "Count")]
Ontology Term OrigCount Count
1 MF molecular_function 3 76
19 CC cellular_component 2 76
34 BP biological_process 5 75
12 CC cell NA 74
13 CC cell part 74 74
2 MF binding 67 65
27 BP cellular process 58 58
21 CC organelle 45 45
36 BP metabolic process 44 44
11 MF catalytic activity 38 38
23 BP biological regulation 12 31
40 BP regulation of biological process 29 29
15 CC organelle part 24 24
44 BP localization 13 21
[snip]
I'm hoping that the attached code provides enough of a starting point for the package maintainer or other motivated party to work up a complete solution and understand the differences in the results.
+ seth
--
Seth Falcon
Program in Computational Biology | Fred Hutchinson Cancer Research Center
More information about the Bioconductor
mailing list