[BioC] write.XStringView/write.XStringSet highly inefficient
Martin Morgan
mtmorgan at fhcrc.org
Tue Jul 27 15:26:42 CEST 2010
On 07/27/2010 04:56 AM, Michael Dondrup wrote:
> Hi,
>
> I was trying to use write.XStringView on a larger dataset but to no avail. It seems like it is not implemented
> efficiently. What I am trying is:
>
> I downloaded http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz
>
>> library(Biostrings)
>> dnasts <- read.DNAStringSet(file="chr1.fa")
Hi Michael --
This is also
library(BSgenome.Hsapiens.UCSC.hg18)
Hsapiens
chr1 = unmasked(Hsapiens[["chr1"]])
> # break up the fasta file into segments of size 60
>> dnaviews <- Views(dnasts[[1]], start = seq(1, length(dnasts[[1]]), 60), width=60)
... and
dnaviews <-
Views(chr1, successiveIRanges(rep(60, ceiling(length(chr1) / 60))))
>> write.XStringViews(dnaviews, file="out.fa")
> system.time(write.XStringSet(as(dnaviews, "DNAStringSet"),
file=tempfile()))
user system elapsed
7.024 0.756 8.030
but this is with
> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-07-20 r52579)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.17.6
[3] Biostrings_2.17.24 GenomicRanges_1.1.17
[5] IRanges_1.7.12
loaded via a namespace (and not attached):
[1] Biobase_2.9.0 tools_2.12.0
> ... I interrupted the process after 1h reaching a memory peak of over 3GB.
> In principle doing the whole task should not take longer than a few seconds. I found this report:
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2010-April/001160.html
> I guess that is the same problem? Has there been any progress?
so yes, there is progress but it requires use of the 'devel' version of
R and Bioconductor.
There were a couple of other posts in that thread
fasta = character(2 * length(dna))
fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="")
fasta[c(FALSE, TRUE)] = as.character(dna)
writeLines(fasta, fl)
and the more complete patch that seemed not to make it to the mailing
list directly but that is in
http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg01135.html
I wonder what you're going to do with your fasta file now?
Hope that helps,
Martin
>
> Is there probably a more efficient way of implementing this, e.g. using cat()?
>
> Thanks a lot
> Michael
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.16.9 IRanges_1.6.8
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list