[R] Memoize and vectorize a custom function
Martin Morgan
mtmorgan at fhcrc.org
Fri Apr 27 04:17:45 CEST 2012
On 04/26/2012 03:21 PM, Kamil Slowikowski wrote:
> My goal is simple: calcuate GC content of each sequence in a list of
> nucleotide
> sequences. I have figured out how to vectorize, but all my attempts at
> memoization failed.
>
> Can you show me how to properly memoize my function?
>
> There is a StackOverflow post on the subject of memoization, but it does not
> help me:
> http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r
>
> I haven't been able to find any other discussions on this subject. Searching
> for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
> for those keywords at http://r-project.markmail.org/ does not return helpful
> discussions.
>
> Here's my data:
>
> seqs<- c("","G","C","CCC","T","","TTCCT","","C","CTC")
>
> Some sequences are missing, so they're blank `""`.
>
> I have a function for calculating GC content:
>
> > GC<- function(s) {
> if (!is.character(s)) return(NA)
> n<- nchar(s)
> if (n == 0) return(NA)
> m<- gregexpr('[GCSgcs]', s)[[1]]
> if (m[1]< 1) return(0)
> return(100.0 * length(m) / n)
> }
>
> It works:
>
> > GC('')
> [1] NA
> > GC('G')
> [1] 100
> > GC('GAG')
> [1] 66.66667
> > sapply(seqs, GC)
> G C CCC T TTCCT
> C
> NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000
> NA 100.00000
> CTC
> 66.66667
>
> I want to memoize it. Then, I want to vectorize it. Should be easy, right?
Hi Kamil --
Not really an answer to your question, but looking at
http://bioconductor.org/packages/2.10/bioc/html/Biostrings.html
will tell you to install Biostrings with
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
and then
library(Biostrings)
dna = DNAStringSet(c("","G","C","CCC","T","","TTCCT","","C","CTC"))
alf = alphabetFrequency(dna, as.prob=TRUE, baseOnly=TRUE)
rowSums(alf[,c("G", "C")])
will give you GC content of each string.
> rowSums(alf[,c("G", "C")])
[1] NaN 1.0000000 1.0000000 1.0000000 0.0000000 NaN 0.4000000
[8] NaN 1.0000000 0.6666667
this will be fast and scalable; Biostrings and other Bioconductor
(http://bioconductor.org) packages have many useful functions for
working with DNA.
See the Bioconductor mailing list for more help if this is a promising
direction.
http://bioconductor.org/help/mailing-list/
Martin
>
> Apparently, I must have the wrong mindset for using the `memoise` or
> `R.cache`
> R packages:
>
> > system.time(dummy<- sapply(rep(seqs,100), GC))
> user system elapsed
> 0.044 0.000 0.054
> >
> > library(memoise)
> > GCm1<- memoise(GC)
> > system.time(dummy<- sapply(rep(seqs,100), GCm1))
> user system elapsed
> 0.164 0.000 0.173
> >
> > library(R.cache)
> > GCm2<- addMemoization(GC)
> > system.time(dummy<- sapply(rep(seqs,100), GCm2))
> user system elapsed
> 10.601 0.252 10.926
>
> Notice that the memoized functions are several orders of magnitude slower.
>
> I tried the `hash` package, but things seem to be happening behind the
> scenes
> and I don't understand the output:
>
> > cache<- hash()
> > GCc<- function(s) {
> if (!is.character(s) || nchar(s) == 0) {
> return(NA)
> }
> if(exists(s, cache)) {
> return(cache[[s]])
> }
> result<- GC(s)
> cache[[s]]<<- result
> return(result)
> }
> > sapply(seqs,GCc)
> [[1]]
> [1] NA
>
> $G
> [1] 100
>
> $C
> NULL
>
> $CCC
> [1] 100
>
> $T
> NULL
>
> [[6]]
> [1] NA
>
> $TTCCT
> [1] 40
>
> [[8]]
> [1] NA
>
> $C
> NULL
>
> $CTC
> [1] 66.66667
>
> At least I figured out how to vectorize:
>
> > GCv<- Vectorize(GC)
> > GCv(seqs)
> G C CCC T TTCCT
> C
> 0.00000 100.00000 100.00000 100.00000 0.00000 0.00000 40.00000
> 0.00000 100.00000
> CTC
> 66.66667
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the R-help
mailing list