[Bioc-devel] writeVcf performance

Michael Lawrence michafla at gene.com
Tue Aug 26 19:47:56 CEST 2014


My understanding is that the heap optimization provided marginal gains, and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

> Hi Gabe,
>
> Martin responded, and so did Michael,
>
> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>
> It sounded like Michael was ok with working with/around heap
> initialization.
>
> Michael, is that right or should we still consider this on the table?
>
>
> Val
>
>
> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>
>> Val,
>>
>> Has there been any movement on this? This remains a substantial
>> bottleneck for us when writing very large VCF files (e.g.
>> variants+genotypes for whole genome NGS samples).
>>
>> I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
>> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
>> parallelization strategy and no other changes. I suspect this could be
>> improved on quite a bit, or possibly made irrelevant with judicious use
>> of serial C code.
>>
>> Did you and Martin make any plans regarding optimizing writeVcf?
>>
>> Best
>> ~G
>>
>>
>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> wrote:
>>
>>     Hi Michael,
>>
>>     I'm interested in working on this. I'll discuss with Martin next
>>     week when we're both back in the office.
>>
>>     Val
>>
>>
>>
>>
>>
>>     On 08/05/14 07:46, Michael Lawrence wrote:
>>
>>         Hi guys (Val, Martin, Herve):
>>
>>         Anyone have an itch for optimization? The writeVcf function is
>>         currently a
>>         bottleneck in our WGS genotyping pipeline. For a typical 50
>>         million row
>>         gVCF, it was taking 2.25 hours prior to yesterday's improvements
>>         (pasteCollapseRows) that brought it down to about 1 hour, which
>>         is still
>>         too long by my standards (> 0). Only takes 3 minutes to call the
>>         genotypes
>>         (and associated likelihoods etc) from the variant calls (using
>>         80 cores and
>>         450 GB RAM on one node), so the output is an issue. Profiling
>>         suggests that
>>         the running time scales non-linearly in the number of rows.
>>
>>         Digging a little deeper, it seems to be something with R's
>>         string/memory
>>         allocation. Below, pasting 1 million strings takes 6 seconds, but
>> 10
>>         million strings takes over 2 minutes. It gets way worse with 50
>>         million. I
>>         suspect it has something to do with R's string hash table.
>>
>>         set.seed(1000)
>>         end <- sample(1e8, 1e6)
>>         system.time(paste0("END", "=", end))
>>              user  system elapsed
>>             6.396   0.028   6.420
>>
>>         end <- sample(1e8, 1e7)
>>         system.time(paste0("END", "=", end))
>>              user  system elapsed
>>         134.714   0.352 134.978
>>
>>         Indeed, even this takes a long time (in a fresh session):
>>
>>         set.seed(1000)
>>         end <- sample(1e8, 1e6)
>>         end <- sample(1e8, 1e7)
>>         system.time(as.character(end))
>>              user  system elapsed
>>            57.224   0.156  57.366
>>
>>         But running it a second time is faster (about what one would
>>         expect?):
>>
>>         system.time(levels <- as.character(end))
>>              user  system elapsed
>>            23.582   0.021  23.589
>>
>>         I did some simple profiling of R to find that the resizing of
>>         the string
>>         hash table is not a significant component of the time. So maybe
>>         something
>>         to do with the R heap/gc? No time right now to go deeper. But I
>>         know Martin
>>         likes this sort of thing ;)
>>
>>         Michael
>>
>>                  [[alternative HTML version deleted]]
>>
>>         _________________________________________________
>>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>         mailing list
>>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>     _________________________________________________
>>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>> list
>>     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>
>>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list