[BioC] write.XStringView/write.XStringSet highly inefficient

Tue Jul 27 15:26:42 CEST 2010

On 07/27/2010 04:56 AM, Michael Dondrup wrote:
> Hi,
> 
> I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented 
> efficiently. What I am trying is:
> 
> I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz
> 
>> library(Biostrings)
>> dnasts <- read.DNAStringSet(file="chr1.fa")

Hi Michael --

This is also

library(BSgenome.Hsapiens.UCSC.hg18)
Hsapiens
chr1 = unmasked(Hsapiens[["chr1"]])

> # break up the fasta file into segments of size 60
>> dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60)

... and

dnaviews <-
    Views(chr1, successiveIRanges(rep(60, ceiling(length(chr1) / 60))))

>> write.XStringViews(dnaviews, file="out.fa")

> system.time(write.XStringSet(as(dnaviews, "DNAStringSet"),
                               file=tempfile()))
   user  system elapsed
  7.024   0.756   8.030

but this is with

> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-07-20 r52579)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.17.6
[3] Biostrings_2.17.24                 GenomicRanges_1.1.17
[5] IRanges_1.7.12

loaded via a namespace (and not attached):
[1] Biobase_2.9.0 tools_2.12.0

> ... I interrupted the process after 1h reaching a memory peak of over 3GB.
> In principle doing the whole task should not take longer than a few seconds. I found this report:
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-April/001160.html
> I guess that is the same problem? Has there been any progress?

so yes, there is progress but it requires use of the 'devel' version of
R and Bioconductor.

There were a couple of other posts in that thread

  fasta = character(2 * length(dna))
  fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="")
  fasta[c(FALSE, TRUE)] = as.character(dna)
  writeLines(fasta, fl)

and the more complete patch that seemed not to make it to the mailing
list directly but that is in
http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg01135.html

I wonder what you're going to do with your fasta file now?

Hope that helps,

Martin

> 
>  Is there probably a more efficient way of implementing this, e.g. using cat()?
> 
> Thanks a lot
> Michael
> 
>> sessionInfo()
> R version 2.11.1 (2010-05-31) 
> x86_64-unknown-linux-gnu 
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] Biostrings_2.16.9 IRanges_1.6.8    
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0
>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793