[BioC] write.XStringView/write.XStringSet highly inefficient (solved)

Wed Jul 28 11:10:24 CEST 2010

Thank you Martin,

the fact that write.XStringSet has been improved so much is really an important feature. So I am very glad to hear this.
I got the idea from here: 
http://biostar.stackexchange.com/questions/1853/code-golf-digesting-fasta-sequences-into-a-set-of-smaller-sequences
However, I  am using this function to write out real-world data and was startled by the fact that it could break
would the data become larger, hence my question.

I also tested some methods in pure-R mentioned by you, maybe this is useful to someone until the next version of Bioconductor is released:

1.
> http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg01135.html

This solution was in fact too slow for my purpose, possibly because it uses for-loops and type conversions.

2. this solution was the fastest one in my case:

writeFASTA <- function(dna,  desc=names(dna),  file=stdout()) {
  if (is.null(desc))
    desc <- paste(seq(along=dna))
  fasta = character(2 * length(dna))
  fasta[c(TRUE, FALSE)] <- paste(">", desc, sep="")
  fasta[c(FALSE, TRUE)] <- as.character(dna)
  writeLines(fasta, file)
}

The downside of this function is it does not wrap long sequence lines. So I came up with another one:

3.
writeFASTA <- function(sequences, desc=names(sequences), width=80, file=stdout(), append=FALSE) {
  sequences <- as.character(sequences)
  if(!is.null(desc) && length(sequences) != length(desc))
     stop("wrong length of 'desc'")
  desc <- if (is.null(desc))
    paste(">", seq(along=sequences))
  else
    paste(">", desc, "\n", sep="") 
  ## Adjust line width or add a \n 
  sequences <-  if (! is.null(width))
    gsub( sprintf("(.{1,%d})", width), "\\1\n", sequences )
  else
    paste(sequences, "\n", sep="")
  cat(paste(desc,sequences, sep="", collapse=""), sep="", file=file, append=append )
}

Not totally fool-proof, but  still much faster than solution 1.

Best
Michael

Am Jul 27, 2010 um 3:26 PM schrieb Martin Morgan:

> On 07/27/2010 04:56 AM, Michael Dondrup wrote:
>> Hi,
>> 
>> I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented 
>> efficiently. What I am trying is:
>> 
>> I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz
>> 
>>> library(Biostrings)
>>> dnasts <- read.DNAStringSet(file="chr1.fa")
> 
> Hi Michael --
> 
> This is also
> 
> library(BSgenome.Hsapiens.UCSC.hg18)
> Hsapiens
> chr1 = unmasked(Hsapiens[["chr1"]])
> 
>> # break up the fasta file into segments of size 60
>>> dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60)
> 
> ... and
> 
> dnaviews <-
>    Views(chr1, successiveIRanges(rep(60, ceiling(length(chr1) / 60))))
> 
>>> write.XStringViews(dnaviews, file="out.fa")
> 
>> system.time(write.XStringSet(as(dnaviews, "DNAStringSet"),
>                               file=tempfile()))
>   user  system elapsed
>  7.024   0.756   8.030
> 
> 
> but this is with
> 
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-07-20 r52579)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.17.6
> [3] Biostrings_2.17.24                 GenomicRanges_1.1.17
> [5] IRanges_1.7.12
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.0 tools_2.12.0
> 
> 
>> ... I interrupted the process after 1h reaching a memory peak of over 3GB.
>> In principle doing the whole task should not take longer than a few seconds. I found this report:
>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-April/001160.html
>> I guess that is the same problem? Has there been any progress?
> 
> so yes, there is progress but it requires use of the 'devel' version of
> R and Bioconductor.
> 
> There were a couple of other posts in that thread
> 
>  fasta = character(2 * length(dna))
>  fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="")
>  fasta[c(FALSE, TRUE)] = as.character(dna)
>  writeLines(fasta, fl)
> 
> and the more complete patch that seemed not to make it to the mailing
> list directly but that is in
> http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg01135.html
> 
> I wonder what you're going to do with your fasta file now?
> 
> Hope that helps,
> 
> Martin
> 
>> 
>> Is there probably a more efficient way of implementing this, e.g. using cat()?
>> 
>> Thanks a lot
>> Michael
>> 
>>> sessionInfo()
>> R version 2.11.1 (2010-05-31) 
>> x86_64-unknown-linux-gnu 
>> 
>> locale:
>> [1] C
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base     
>> 
>> other attached packages:
>> [1] Biostrings_2.16.9 IRanges_1.6.8    
>> 
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0
>>> 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793