[Bioc-sig-seq] write.XStringSet() terribly slow

Hans-Ulrich Klein h.klein at uni-muenster.de
Wed May 5 15:37:56 CEST 2010


Hi,

I have have the same problem. I want to write ~ 4Mio small (25bps) 
sequences into one fasta file. write.XStringSet() is very slow. Also, 
writeFASTA() is very low. Only about 1500 sequences are written per minute.

Are there any alternatives?

Best wishes,
Hans-Ulrich


 > sessionInfo()
R version 2.11.0 RC (2010-04-19 r51778)
x86_64-pc-linux-gnu

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.6.2     Rsamtools_1.0.1     lattice_0.18-5
[4] Biostrings_2.16.0   GenomicRanges_1.0.1 IRanges_1.6.0

loaded via a namespace (and not attached):
[1] Biobase_2.8.0 grid_2.11.0   hwriter_1.2   tools_2.11.0





Steffen Neumann wrote:
> Hi,
>
> I have some major performance problems writing fasta files
> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one DNAString,
> and writing that to a file takes ages, as you see from the strace output
> below: I obtain ~5 lines (80 chars each) per second. The runtime
> of the system call<in brackets>  is neglectible.
>
> library(Biostrings)
> chromosome<-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>
> Is there a fundamental flaw in my thinking ?
> Is there an alternative to write.XStringSet() ?
> This happens both on my laptop and a beefy server.
>
> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
> and get ~11 lines per second.
>
> Yours,
> Steffen
>
> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80<0.000137>
> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80<0.000142>
> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80<0.000133>
> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80<0.000159>
> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80<0.000133>
> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80<0.000136>
> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80<0.000594>
>
> sessionInfo()
> R version 2.10.0 (2009-10-26)
> x86_64-unknown-linux-gnu
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.14.12 IRanges_1.4.16
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.0
>
>    


-- 
Hans-Ulrich Klein
Department of Medical Informatics and Biomathematics
University of Münster
Domagkstrasse 9
48149 Münster, Germany
Tel.: +49 (0)251 83-58405



More information about the Bioc-sig-sequencing mailing list