[Bioc-sig-seq] write.XStringSet() terribly slow

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Fri Apr 16 17:12:00 CEST 2010


Seems like the version in Biostrings is slightly broken (the argument
checking is not great and it chokes on various use cases), but it
works for something like dumping a whole BSgenome, like

writeFASTA(x = Scerevisiae,
  desc = paste("chr", 1:length(seqnames(Scerevisiae)), sep = ""),
  file = "bsgenome_scerevisiae.fa")

It looks like the entire write.XStringSet should be re-thought a bit.
I'll look into this hopefully today (unless someone else beats me to
it).

Kasper

On Fri, Apr 16, 2010 at 9:55 AM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
> I don't know if there has been a refactoring of the code, but I while
> ago I send a patch to writeFASTA making it magnitudes faster, so you
> should perhaps try that one.  The patch makes it pretty fast to dump
> entire bsgenomes into fasta files.
>
> Kasper
>
> On Fri, Apr 16, 2010 at 9:17 AM, Steffen Neumann <sneumann at ipb-halle.de> wrote:
>> Hi,
>>
>> I have some major performance problems writing fasta files
>> with Biostrings. I have the full Arabidopsis Chr1 (30MByte) in one DNAString,
>> and writing that to a file takes ages, as you see from the strace output
>> below: I obtain ~5 lines (80 chars each) per second. The runtime
>> of the system call <in brackets> is neglectible.
>>
>> library(Biostrings)
>> chromosome <-read.DNAStringSet("Chr1_TAIR9.fasta", "fasta")
>> write.XStringSet(chromosome, file="/tmp/test.fasta", format="fasta")
>>
>> Is there a fundamental flaw in my thinking ?
>> Is there an alternative to write.XStringSet() ?
>> This happens both on my laptop and a beefy server.
>>
>> I also tried the (ancient) IRanges_1.0.16 and Biostrings_2.10.22,
>> and get ~11 lines per second.
>>
>> Yours,
>> Steffen
>>
>> 13:06:09.949290 write(4, "TAGGAGTTGATGAAGACATCTAACGAAAATTC"..., 80) = 80 <0.000137>
>> 13:06:10.138835 write(4, "GTGCTCAGGCTTCATTGATAAGGAAAGAAACA"..., 80) = 80 <0.000142>
>> 13:06:10.328395 write(4, "AAAGCAGAAACCGACGTGAAATATTACAGAGA"..., 80) = 80 <0.000133>
>> 13:06:10.537475 write(4, "AGACTACTCGAGAATCATTGCACTGAAGAAAG"..., 80) = 80 <0.000159>
>> 13:06:10.727281 write(4, "AAGTGAAAAGAGAAAGAGAATGTGTGATGTGT"..., 80) = 80 <0.000133>
>> 13:06:10.916854 write(4, "CTTTGCTTTAAATGCAATCAGCTTCACGAGAA"..., 80) = 80 <0.000136>
>> 13:06:11.105687 write(4, "GATTCAAGCTCGTTTCGCTCGCTCCGGGTGAA"..., 80) = 80 <0.000594>
>>
>> sessionInfo()
>> R version 2.10.0 (2009-10-26)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] Biostrings_2.14.12 IRanges_1.4.16
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.6.0
>>
>> --
>> IPB Halle                    AG Massenspektrometrie & Bioinformatik
>> Dr. Steffen Neumann          http://www.IPB-Halle.DE
>> Weinberg 3                   http://msbi.bic-gh.de
>> 06120 Halle                  Tel. +49 (0) 345 5582 - 1470
>>                                  +49 (0) 345 5582 - 0
>> sneumann(at)IPB-Halle.DE     Fax. +49 (0) 345 5582 - 1409
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>



More information about the Bioc-sig-sequencing mailing list