[Bioc-devel] serializing pairwise alignment objects

Hahne, Florian florian.hahne at novartis.com
Wed Nov 7 08:41:19 CET 2012


Great Herve,
thanks a lot!
Florian
-- 






On 11/7/12 3:06 AM, "Hervé Pagès" <hpages at fhcrc.org> wrote:

>Hi Florian,
>
>I just removed the 'substitutionArray' slot from PairwiseAlignments
>objects in Biostrings 2.27.7. The slot didn't seem to be used/needed
>by any downstream method.
>
>   > packageVersion("Biostrings")
>   [1] Œ2.27.7¹
>   > x <- "xxxabcdefghijklmnopqyyy"
>   > y <- "abcdhijkzzzzlmnpqr"
>   > pa <- pairwiseAlignment(x, y)
>   > slotNames(pa)
>   [1] "pattern"      "subject"      "type"         "score"
>"gapOpening"
>   [6] "gapExtension"
>   > validObject(pa)
>   [1] TRUE
>   > object.size(pa)
>   35528 bytes
>
>... instead of 35308996 bytes! 3 orders of magnitude smaller :-)
>
>Cheers,
>H.
>
>
>On 11/05/2012 03:45 AM, Hahne, Florian wrote:
>> Indeed. I did not look the far into the implementation, it just seemed
>>odd
>> to me that the objects got that inflated. scoreOnly is not really that
>> helpful if you want to deal with the actual alignments. The only
>> reasonable application I see for it is if you want to rank a bunch of
>> sequences by pairwise similarity. This gigantic memory footprint is
>>really
>> breaking things once you start doing a lot of these pairwise alignment
>> operations in parallel. mclapply complains about not being able to turn
>> such large objects into a raw vector, and serializing to disk quickly
>> fills your hard drive. You also loose a lot of the time gained by
>>parallel
>> processing just by writing and loading gigabytes of data...
>> I don't know enough about the internals of the PairwiseAlignments
>>classes,
>> but it seems that there must be a way to avoid having this huge array as
>> part of the object. As a quick and dirty fix for now I just replaced the
>> substitutionArray slot with an empty matrix and all the downstream
>> operations that I wanted to do still work. Would be great if you could
>> take a look into this, Herve.
>> Thanks,
>> Florian
>>
>
>-- 
>Hervé Pagès
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org
>Phone:  (206) 667-5791
>Fax:    (206) 667-1319



More information about the Bioc-devel mailing list