[Bioc-sig-seq] write.XStringSet() terribly slow

Martin Morgan mtmorgan at fhcrc.org
Wed May 5 23:35:16 CEST 2010


On 05/05/2010 02:09 PM, Hervé Pagès wrote:
> Hans-Ulrich Klein wrote:
>> Hi,
>>
>> I have have the same problem. I want to write ~ 4Mio small (25bps)
>> sequences into one fasta file. write.XStringSet() is very slow. Also,
>> writeFASTA() is very low. Only about 1500 sequences are written per
>> minute.

if 'dna' is a DNAStringSet with names, and for this case where reaads
are < 80 characters, then maybe

  fasta  = paste(paste(">", names(dna), sep=""),
                 as.character(dna), sep="\n", collapse="\n")
  fl = tempfile()
  writeLines(fasta, fl)

Martin

> 
> OK, I guess it's time to bite the bullet as they say.
> 
> It has been on my TODO list for a long time to implement
> write.XStringSet() in C so I will work on this and let you
> know when it's ready.
> 
> Cheers,
> H.
> 
>>
>> Are there any alternatives?
>>
>> Best wishes,
>> Hans-Ulrich
>>
>>
>>  > sessionInfo()
>> R version 2.11.0 RC (2010-04-19 r51778)
>> x86_64-pc-linux-gnu
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] ShortRead_1.6.2     Rsamtools_1.0.1     lattice_0.18-5
>> [4] Biostrings_2.16.0   GenomicRanges_1.0.1 IRanges_1.6.0
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0 grid_2.11.0   hwriter_1.2   tools_2.11.0
>>
>>
>>
>>
>>
>> Steffen Neumann wrote:
>>> Hi,
>>>
>>> I have some major performance problems writing fasta files
>>> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one
>>> DNAString,
>>> and writing that to a file takes ages, as you see from the strace output
>>> below: I obtain ~5 lines (80 chars each) per second. The runtime
>>> of the system call<in brackets>  is neglectible.
>>>
>>> library(Biostrings)
>>> chromosome<-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
>>> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>>>
>>> Is there a fundamental flaw in my thinking ?
>>> Is there an alternative to write.XStringSet() ?
>>> This happens both on my laptop and a beefy server.
>>>
>>> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
>>> and get ~11 lines per second.
>>>
>>> Yours,
>>> Steffen
>>>
>>> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) =
>>> 80<0.000137>
>>> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) =
>>> 80<0.000142>
>>> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) =
>>> 80<0.000133>
>>> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) =
>>> 80<0.000159>
>>> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) =
>>> 80<0.000133>
>>> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) =
>>> 80<0.000136>
>>> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) =
>>> 80<0.000594>
>>>
>>> sessionInfo()
>>> R version 2.10.0 (2009-10-26)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.14.12 IRanges_1.4.16
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.6.0
>>>
>>>    
>>
>>
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list