[Bioc-devel] writeVcf performance
Valerie Obenchain
vobencha at fhcrc.org
Tue Aug 26 18:43:04 CEST 2014
Hi Gabe,
Martin responded, and so did Michael,
https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
It sounded like Michael was ok with working with/around heap
initialization.
Michael, is that right or should we still consider this on the table?
Val
On 08/26/2014 09:34 AM, Gabe Becker wrote:
> Val,
>
> Has there been any movement on this? This remains a substantial
> bottleneck for us when writing very large VCF files (e.g.
> variants+genotypes for whole genome NGS samples).
>
> I was able to see a ~25% speedup with 4 cores and an "optimal" speedup
> of ~2x with 10-12 cores for a VCF with 500k rows using a very naive
> parallelization strategy and no other changes. I suspect this could be
> improved on quite a bit, or possibly made irrelevant with judicious use
> of serial C code.
>
> Did you and Martin make any plans regarding optimizing writeVcf?
>
> Best
> ~G
>
>
> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> Hi Michael,
>
> I'm interested in working on this. I'll discuss with Martin next
> week when we're both back in the office.
>
> Val
>
>
>
>
>
> On 08/05/14 07:46, Michael Lawrence wrote:
>
> Hi guys (Val, Martin, Herve):
>
> Anyone have an itch for optimization? The writeVcf function is
> currently a
> bottleneck in our WGS genotyping pipeline. For a typical 50
> million row
> gVCF, it was taking 2.25 hours prior to yesterday's improvements
> (pasteCollapseRows) that brought it down to about 1 hour, which
> is still
> too long by my standards (> 0). Only takes 3 minutes to call the
> genotypes
> (and associated likelihoods etc) from the variant calls (using
> 80 cores and
> 450 GB RAM on one node), so the output is an issue. Profiling
> suggests that
> the running time scales non-linearly in the number of rows.
>
> Digging a little deeper, it seems to be something with R's
> string/memory
> allocation. Below, pasting 1 million strings takes 6 seconds, but 10
> million strings takes over 2 minutes. It gets way worse with 50
> million. I
> suspect it has something to do with R's string hash table.
>
> set.seed(1000)
> end <- sample(1e8, 1e6)
> system.time(paste0("END", "=", end))
> user system elapsed
> 6.396 0.028 6.420
>
> end <- sample(1e8, 1e7)
> system.time(paste0("END", "=", end))
> user system elapsed
> 134.714 0.352 134.978
>
> Indeed, even this takes a long time (in a fresh session):
>
> set.seed(1000)
> end <- sample(1e8, 1e6)
> end <- sample(1e8, 1e7)
> system.time(as.character(end))
> user system elapsed
> 57.224 0.156 57.366
>
> But running it a second time is faster (about what one would
> expect?):
>
> system.time(levels <- as.character(end))
> user system elapsed
> 23.582 0.021 23.589
>
> I did some simple profiling of R to find that the resizing of
> the string
> hash table is not a significant component of the time. So maybe
> something
> to do with the R heap/gc? No time right now to go deeper. But I
> know Martin
> likes this sort of thing ;)
>
> Michael
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
> --
> Computational Biologist
> Genentech Research
More information about the Bioc-devel
mailing list