[Bioc-sig-seq] write.XStringSet() terribly slow

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Fri Apr 16 15:55:24 CEST 2010


I don't know if there has been a refactoring of the code, but I while
ago I send a patch to writeFASTA making it magnitudes faster, so you
should perhaps try that one.  The patch makes it pretty fast to dump
entire bsgenomes into fasta files.

Kasper

On Fri, Apr 16, 2010 at 9:17 AM, Steffen Neumann <sneumann at ipb-halle.de> wrote:
> Hi,
>
> I have some major performance problems writing fasta files
> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one DNAString,
> and writing that to a file takes ages, as you see from the strace output
> below: I obtain ~5 lines (80 chars each) per second. The runtime
> of the system call <in brackets> is neglectible.
>
> library(Biostrings)
> chromosome <-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>
> Is there a fundamental flaw in my thinking ?
> Is there an alternative to write.XStringSet() ?
> This happens both on my laptop and a beefy server.
>
> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
> and get ~11 lines per second.
>
> Yours,
> Steffen
>
> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80 <0.000137>
> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80 <0.000142>
> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80 <0.000133>
> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80 <0.000159>
> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80 <0.000133>
> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80 <0.000136>
> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80 <0.000594>
>
> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-unknown-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.14.12 IRanges_1.4.16
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.0
>
> --
> IPB Halle                    AG Massenspektrometrie & Bioinformatik
> Dr. Steffen Neumann          http://www.IPB-Halle.DE
> Weinberg 3                   http://msbi.bic-gh.de
> 06120 Halle                  Tel. +49 (0) 345 5582 - 1470
>                                  +49 (0) 345 5582 - 0
> sneumann(at)IPB-Halle.DE     Fax. +49 (0) 345 5582 - 1409
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>



More information about the Bioc-sig-sequencing mailing list