[R] Memoize and vectorize a custom function
Henrik Bengtsson
hb at biostat.ucsf.edu
Fri Apr 27 18:15:01 CEST 2012
On Thu, Apr 26, 2012 at 3:21 PM, Kamil Slowikowski
<kslowikowski at gmail.com> wrote:
> My goal is simple: calcuate GC content of each sequence in a list of
> nucleotide
> sequences. I have figured out how to vectorize, but all my attempts at
> memoization failed.
>
> Can you show me how to properly memoize my function?
>
> There is a StackOverflow post on the subject of memoization, but it does not
> help me:
> http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r
>
> I haven't been able to find any other discussions on this subject. Searching
> for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
> for those keywords at http://r-project.markmail.org/ does not return helpful
> discussions.
>
> Here's my data:
>
> seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
>
> Some sequences are missing, so they're blank `""`.
>
> I have a function for calculating GC content:
>
> > GC <- function(s) {
> if (!is.character(s)) return(NA)
> n <- nchar(s)
> if (n == 0) return(NA)
> m <- gregexpr('[GCSgcs]', s)[[1]]
> if (m[1] < 1) return(0)
> return(100.0 * length(m) / n)
> }
>
> It works:
>
> > GC('')
> [1] NA
> > GC('G')
> [1] 100
> > GC('GAG')
> [1] 66.66667
> > sapply(seqs, GC)
> G C CCC T TTCCT
> C
> NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000
> NA 100.00000
> CTC
> 66.66667
>
> I want to memoize it. Then, I want to vectorize it. Should be easy, right?
>
> Apparently, I must have the wrong mindset for using the `memoise` or
> `R.cache`
> R packages:
>
> > system.time(dummy <- sapply(rep(seqs,100), GC))
> user system elapsed
> 0.044 0.000 0.054
> >
> > library(memoise)
> > GCm1 <- memoise(GC)
> > system.time(dummy <- sapply(rep(seqs,100), GCm1))
> user system elapsed
> 0.164 0.000 0.173
> >
> > library(R.cache)
> > GCm2 <- addMemoization(GC)
> > system.time(dummy <- sapply(rep(seqs,100), GCm2))
> user system elapsed
> 10.601 0.252 10.926
>
> Notice that the memoized functions are several orders of magnitude slower.
About R.cache: All memoization by R.cache is currently done toward the
file system. In other words, it is designed for larger objects (so
you cannot hold all of the cache in memory) and more computationally
expensive tasks.
/Henrik
>
> I tried the `hash` package, but things seem to be happening behind the
> scenes
> and I don't understand the output:
>
> > cache <- hash()
> > GCc <- function(s) {
> if (!is.character(s) || nchar(s) == 0) {
> return(NA)
> }
> if(exists(s, cache)) {
> return(cache[[s]])
> }
> result <- GC(s)
> cache[[s]] <<- result
> return(result)
> }
> > sapply(seqs,GCc)
> [[1]]
> [1] NA
>
> $G
> [1] 100
>
> $C
> NULL
>
> $CCC
> [1] 100
>
> $T
> NULL
>
> [[6]]
> [1] NA
>
> $TTCCT
> [1] 40
>
> [[8]]
> [1] NA
>
> $C
> NULL
>
> $CTC
> [1] 66.66667
>
> At least I figured out how to vectorize:
>
> > GCv <- Vectorize(GC)
> > GCv(seqs)
> G C CCC T TTCCT
> C
> 0.00000 100.00000 100.00000 100.00000 0.00000 0.00000 40.00000
> 0.00000 100.00000
> CTC
> 66.66667
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list