[R] Memoize and vectorize a custom function

Fri Apr 27 04:17:45 CEST 2012

On 04/26/2012 03:21 PM, Kamil Slowikowski wrote:
> My goal is simple: calcuate GC content of each sequence in a list of
> nucleotide
> sequences. I have figured out how to vectorize, but all my attempts at
> memoization failed.
>
> Can you show me how to properly memoize my function?
>
> There is a StackOverflow post on the subject of memoization, but it does not
> help me:
> http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r
>
> I haven't been able to find any other discussions on this subject. Searching
> for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
> for those keywords at http://r-project.markmail.org/ does not return helpful
> discussions.
>
> Here's my data:
>
>      seqs<- c("","G","C","CCC","T","","TTCCT","","C","CTC")
>
> Some sequences are missing, so they're blank `""`.
>
> I have a function for calculating GC content:
>
>      >  GC<- function(s) {
>          if (!is.character(s)) return(NA)
>          n<- nchar(s)
>          if (n == 0) return(NA)
>          m<- gregexpr('[GCSgcs]', s)[[1]]
>          if (m[1]<  1) return(0)
>          return(100.0 * length(m) / n)
>      }
>
> It works:
>
>      >  GC('')
>      [1] NA
>      >  GC('G')
>      [1] 100
>      >  GC('GAG')
>      [1] 66.66667
>      >  sapply(seqs, GC)
>                        G         C       CCC         T               TTCCT
>                  C
>             NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000
>       NA 100.00000
>            CTC
>       66.66667
>
> I want to memoize it. Then, I want to vectorize it. Should be easy, right?

Hi Kamil --

Not really an answer to your question, but looking at

   http://bioconductor.org/packages/2.10/bioc/html/Biostrings.html

will tell you to install Biostrings with

   source("http://bioconductor.org/biocLite.R")
   biocLite("Biostrings")

and then

   library(Biostrings)
   dna = DNAStringSet(c("","G","C","CCC","T","","TTCCT","","C","CTC"))
   alf = alphabetFrequency(dna, as.prob=TRUE, baseOnly=TRUE)
   rowSums(alf[,c("G", "C")])

will give you GC content of each string.

 > rowSums(alf[,c("G", "C")])
  [1]       NaN 1.0000000 1.0000000 1.0000000 0.0000000       NaN 0.4000000
  [8]       NaN 1.0000000 0.6666667

this will be fast and scalable; Biostrings and other Bioconductor 
(http://bioconductor.org) packages have many useful functions for 
working with DNA.

See the Bioconductor mailing list for more help if this is a promising 
direction.

   http://bioconductor.org/help/mailing-list/

Martin

>
> Apparently, I must have the wrong mindset for using the `memoise` or
> `R.cache`
> R packages:
>
>      >  system.time(dummy<- sapply(rep(seqs,100), GC))
>         user  system elapsed
>        0.044   0.000   0.054
>      >
>      >  library(memoise)
>      >  GCm1<- memoise(GC)
>      >  system.time(dummy<- sapply(rep(seqs,100), GCm1))
>         user  system elapsed
>        0.164   0.000   0.173
>      >
>      >  library(R.cache)
>      >  GCm2<- addMemoization(GC)
>      >  system.time(dummy<- sapply(rep(seqs,100), GCm2))
>         user  system elapsed
>       10.601   0.252  10.926
>
> Notice that the memoized functions are several orders of magnitude slower.
>
> I tried the `hash` package, but things seem to be happening behind the
> scenes
> and I don't understand the output:
>
>      >  cache<- hash()
>      >  GCc<- function(s) {
>          if (!is.character(s) || nchar(s) == 0) {
>              return(NA)
>          }
>          if(exists(s, cache)) {
>              return(cache[[s]])
>          }
>          result<- GC(s)
>          cache[[s]]<<- result
>          return(result)
>      }
>      >  sapply(seqs,GCc)
>      [[1]]
>      [1] NA
>
>      $G
>      [1] 100
>
>      $C
>      NULL
>
>      $CCC
>      [1] 100
>
>      $T
>      NULL
>
>      [[6]]
>      [1] NA
>
>      $TTCCT
>      [1] 40
>
>      [[8]]
>      [1] NA
>
>      $C
>      NULL
>
>      $CTC
>      [1] 66.66667
>
> At least I figured out how to vectorize:
>
>      >  GCv<- Vectorize(GC)
>      >  GCv(seqs)
>                        G         C       CCC         T               TTCCT
>                  C
>        0.00000 100.00000 100.00000 100.00000   0.00000   0.00000  40.00000
> 0.00000 100.00000
>            CTC
>       66.66667
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793