[Bioc-sig-seq] write.XStringSet() terribly slow

Thu May 6 10:13:06 CEST 2010

Hi,

Yes, that is much faster. Thank you. I will use that for now and for the 
future I will hope for Herve's faster implementation of write.XStringSet().

Best,
Hans-Ulrich

Martin Morgan wrote:
> On 05/05/2010 02:35 PM, Martin Morgan wrote:
>    
>> On 05/05/2010 02:09 PM, Hervé Pagès wrote:
>>      
>>> Hans-Ulrich Klein wrote:
>>>        
>>>> Hi,
>>>>
>>>> I have have the same problem. I want to write ~ 4Mio small (25bps)
>>>> sequences into one fasta file. write.XStringSet() is very slow. Also,
>>>> writeFASTA() is very low. Only about 1500 sequences are written per
>>>> minute.
>>>>          
>> if 'dna' is a DNAStringSet with names, and for this case where reaads
>> are<  80 characters, then maybe
>>
>>    fasta  = paste(paste(">", names(dna), sep=""),
>>                   as.character(dna), sep="\n", collapse="\n")
>>    fl = tempfile()
>>    writeLines(fasta, fl)
>>      
> or probably better
>
>    fasta = character(2 * length(dna))
>    fasta[c(TRUE, FALSE)] = paste(">", names(dna), sep="")
>    fasta[c(FALSE, TRUE)] = as.character(dna)
>    writeLines(fasta, fl)
>
> Martin
>
>    
>> Martin
>>
>>      
>>> OK, I guess it's time to bite the bullet as they say.
>>>
>>> It has been on my TODO list for a long time to implement
>>> write.XStringSet() in C so I will work on this and let you
>>> know when it's ready.
>>>
>>> Cheers,
>>> H.
>>>
>>>        
>>>> Are there any alternatives?
>>>>
>>>> Best wishes,
>>>> Hans-Ulrich
>>>>
>>>>
>>>>   >  sessionInfo()
>>>> R version 2.11.0 RC (2010-04-19 r51778)
>>>> x86_64-pc-linux-gnu
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] ShortRead_1.6.2     Rsamtools_1.0.1     lattice_0.18-5
>>>> [4] Biostrings_2.16.0   GenomicRanges_1.0.1 IRanges_1.6.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.8.0 grid_2.11.0   hwriter_1.2   tools_2.11.0
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Steffen Neumann wrote:
>>>>          
>>>>> Hi,
>>>>>
>>>>> I have some major performance problems writing fasta files
>>>>> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one
>>>>> DNAString,
>>>>> and writing that to a file takes ages, as you see from the strace output
>>>>> below: I obtain ~5 lines (80 chars each) per second. The runtime
>>>>> of the system call<in brackets>   is neglectible.
>>>>>
>>>>> library(Biostrings)
>>>>> chromosome<-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
>>>>> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>>>>>
>>>>> Is there a fundamental flaw in my thinking ?
>>>>> Is there an alternative to write.XStringSet() ?
>>>>> This happens both on my laptop and a beefy server.
>>>>>
>>>>> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
>>>>> and get ~11 lines per second.
>>>>>
>>>>> Yours,
>>>>> Steffen
>>>>>
>>>>> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) =
>>>>> 80<0.000137>
>>>>> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) =
>>>>> 80<0.000142>
>>>>> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) =
>>>>> 80<0.000133>
>>>>> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) =
>>>>> 80<0.000159>
>>>>> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) =
>>>>> 80<0.000133>
>>>>> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) =
>>>>> 80<0.000136>
>>>>> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) =
>>>>> 80<0.000594>
>>>>>
>>>>> sessionInfo()
>>>>> R version 2.10.0 (2009-10-26)
>>>>> x86_64-unknown-linux-gnu
>>>>>
>>>>> locale:
>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>    [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] Biostrings_2.14.12 IRanges_1.4.16
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.6.0
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>>        
>>
>>      
>
>    

-- 
Hans-Ulrich Klein
Department of Medical Informatics and Biomathematics
University of Münster
Domagkstrasse 9
48149 Münster, Germany
Tel.: +49 (0)251 83-58405