[Bioc-devel] writeVcf performance

Tue Aug 26 18:43:04 CEST 2014

Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap 
initialization.

Michael, is that right or should we still consider this on the table?

Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:
> Val,
>
> Has there been any movement on this? This remains a substantial
> bottleneck for us when writing very large VCF files (e.g.
> variants+genotypes for whole genome NGS samples).
>
> I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
> parallelization strategy and no other changes. I suspect this could be
> improved on quite a bit, or possibly made irrelevant with judicious use
> of serial C code.
>
> Did you and Martin make any plans regarding optimizing writeVcf?
>
> Best
> ~G
>
>
> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Hi Michael,
>
>     I'm interested in working on this. I'll discuss with Martin next
>     week when we're both back in the office.
>
>     Val
>
>
>
>
>
>     On 08/05/14 07:46, Michael Lawrence wrote:
>
>         Hi guys (Val, Martin, Herve):
>
>         Anyone have an itch for optimization? The writeVcf function is
>         currently a
>         bottleneck in our WGS genotyping pipeline. For a typical 50
>         million row
>         gVCF, it was taking 2.25 hours prior to yesterday's improvements
>         (pasteCollapseRows) that brought it down to about 1 hour, which
>         is still
>         too long by my standards (> 0). Only takes 3 minutes to call the
>         genotypes
>         (and associated likelihoods etc) from the variant calls (using
>         80 cores and
>         450 GB RAM on one node), so the output is an issue. Profiling
>         suggests that
>         the running time scales non-linearly in the number of rows.
>
>         Digging a little deeper, it seems to be something with R's
>         string/memory
>         allocation. Below, pasting 1 million strings takes 6 seconds, but 10
>         million strings takes over 2 minutes. It gets way worse with 50
>         million. I
>         suspect it has something to do with R's string hash table.
>
>         set.seed(1000)
>         end <- sample(1e8, 1e6)
>         system.time(paste0("END", "=", end))
>              user  system elapsed
>             6.396   0.028   6.420
>
>         end <- sample(1e8, 1e7)
>         system.time(paste0("END", "=", end))
>              user  system elapsed
>         134.714   0.352 134.978
>
>         Indeed, even this takes a long time (in a fresh session):
>
>         set.seed(1000)
>         end <- sample(1e8, 1e6)
>         end <- sample(1e8, 1e7)
>         system.time(as.character(end))
>              user  system elapsed
>            57.224   0.156  57.366
>
>         But running it a second time is faster (about what one would
>         expect?):
>
>         system.time(levels <- as.character(end))
>              user  system elapsed
>            23.582   0.021  23.589
>
>         I did some simple profiling of R to find that the resizing of
>         the string
>         hash table is not a significant component of the time. So maybe
>         something
>         to do with the R heap/gc? No time right now to go deeper. But I
>         know Martin
>         likes this sort of thing ;)
>
>         Michael
>
>                  [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>     _________________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
> --
> Computational Biologist
> Genentech Research