[Bioc-sig-seq] Optimizing the generation of reads w/Biostrings.

Tue Mar 16 18:52:11 CET 2010

Hi All, I have been trying (rather futilely) to optimize the performance
of some simulations which I am trying to do. Briefly, I have a set of
reads which are sampled from a genome. These reads are stored in a
DNAStringSet. The next thing I want to do is apply an error model to
'mutate' the reads.

Below, are three implementations for the same procedure. The first is very
slow and doesn't leverage vectorization. The second is faster, leverages
vectorization, but is done in a way that is counter-intuitive (one thing
to note is that in practice the substitution matrices might be cycle
dependent, hence why I walk over the columns). The third way seemed good,
but there is so much conversion that for 14,000,000 reads I use > 20 Gigs
of memory? Naively, that many reads should be about 1 Gig.

I am looking for low-level Biostrings functionality. something like
applyString, or applyBase. I cannot fathom why this should take so long or
consume so much memory.

thanks, jim

## the read-set (note that I want to do this for 1e7 reads)
BASES <- c("A", "C", "G", "T")
reads <- DNAStringSet(sapply(1:2000, function(i) {
  paste(sample(BASES, replace = T, size = 75),
        collapse = "")
}))

SubM <- structure(c(0.99, 0.0033, 0.0033, 0.0033, 0.0033, 0.99,
                    0.0033,  0.0033, 0.0033, 0.0033, 0.99, 0.0033,
                    0.0033, 0.0033, 0.0033,  0.99),
                  .Dim = c(4L, 4L), .Dimnames = list(BASES, BASES))

f.naive <- function(ds) {
  DNAStringSet(sapply(strsplit(as.character(ds), ""), function(read) {
    paste(sapply(read, function(b) sample(BASES, prob = SubM[b,], size =
1)), collapse = "")
  }))
}

f.lessSo <- function(ds) {
  cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
            'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')

  V <- do.call(rbind, strsplit(as.character(ds), split = ""))
  V <- apply(do.call(cbind, lapply(1:ncol(V), function(i) {
    sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE)
  })), 1, paste, collapse = "")

  DNAStringSet(V)
}

f.withViews <- function(ds) {
  cVec <- c('A', 'C', 'G', 'T', 'A', 'C', 'G', 'T',
            'A', 'C', 'G', 'T', 'A', 'C', 'G', 'T')

  V <- do.call(rbind, strsplit(as.character(ds), split = ""))
  V <- DNAString(paste(as.character(t(do.call(cbind, lapply(1:ncol(V),
function(i) {
    sample(cVec, prob = as.numeric(SM[[i]]), size = nrow(V), replace = TRUE)
  })))), collapse = ""))
  DNAStringSet(Views(V, start = seq(1, length(V), by = 75),
                     end = seq(75, length(V), by = 75)))
}

system.time(r <- f.naive(reads))
###    user  system elapsed
###   2.820   0.000   2.822

system.time(r <- f.lessSo(reads))
###    user  system elapsed
###   0.190   0.000   0.189

system.time(r <- f.withViews(reads))
###    user  system elapsed
###   0.090   0.010   0.097