[Bioc-devel] writeVcf performance
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Fri Aug 29 19:42:37 CEST 2014
Try to run it through the lineprof package for memory profiling; I have
found this to be very helpful.
Here is an old blog post I wrote about it
http://www.hansenlab.org/rstats/2014/01/30/lineprof/
Kasper
On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker <becker.gabe at gene.com> wrote:
> The profiling I attached in my previous email is for 24 geno fields, as I
> said, but our typical usecase involves only ~4-6 fields, and is faster but
> still on the order of dozens of minutes.
>
> Sorry for the confusion.
> ~G
>
>
> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com> wrote:
>
> > Martin and Val.
> >
> > I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
> > with profiling enabled. The results of summaryRprof for that run are
> > attached, though for a variety of reasons they are pretty misleading.
> >
> > It took over an hour to write (3700+seconds), so it's definitely a
> > bottleneck when the data get very large, even if it isn't for smaller
> data.
> >
> > Michael and I both think the culprit is all the pasting and cbinding that
> > is going on, and more to the point, that memory for an internal
> > representation to be written out is allocated at all. Streaming across
> the
> > object, looping by rows and writing directly to file (e.g. from C) should
> > be blisteringly fast in comparison.
> >
> > ~G
> >
> >
> > On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com>
> > wrote:
> >
> >> Gabe is still testing/profiling, but we'll send something randomized
> >> along eventually.
> >>
> >>
> >> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org>
> >> wrote:
> >>
> >>> I didn't see in the original thread a reproducible (simulated, I guess)
> >>> example, to be explicit about what the problem is??
> >>>
> >>> Martin
> >>>
> >>>
> >>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
> >>>
> >>>> My understanding is that the heap optimization provided marginal
> gains,
> >>>> and
> >>>> that we need to think harder about how to optimize the all of the
> string
> >>>> manipulation in writeVcf. We either need to reduce it or reduce its
> >>>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
> >>>>
> >>>>
> >>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <
> vobencha at fhcrc.org>
> >>>> wrote:
> >>>>
> >>>> Hi Gabe,
> >>>>>
> >>>>> Martin responded, and so did Michael,
> >>>>>
> >>>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
> >>>>>
> >>>>> It sounded like Michael was ok with working with/around heap
> >>>>> initialization.
> >>>>>
> >>>>> Michael, is that right or should we still consider this on the table?
> >>>>>
> >>>>>
> >>>>> Val
> >>>>>
> >>>>>
> >>>>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
> >>>>>
> >>>>> Val,
> >>>>>>
> >>>>>> Has there been any movement on this? This remains a substantial
> >>>>>> bottleneck for us when writing very large VCF files (e.g.
> >>>>>> variants+genotypes for whole genome NGS samples).
> >>>>>>
> >>>>>> I was able to see a ~25% speedup with 4 cores and an "optimal"
> >>>>>> speedup
> >>>>>> of ~2x with 10-12 cores for a VCF with 500k rows using a very naive
> >>>>>> parallelization strategy and no other changes. I suspect this could
> be
> >>>>>> improved on quite a bit, or possibly made irrelevant with judicious
> >>>>>> use
> >>>>>> of serial C code.
> >>>>>>
> >>>>>> Did you and Martin make any plans regarding optimizing writeVcf?
> >>>>>>
> >>>>>> Best
> >>>>>> ~G
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <
> vobencha at fhcrc.org
> >>>>>> <mailto:vobencha at fhcrc.org>> wrote:
> >>>>>>
> >>>>>> Hi Michael,
> >>>>>>
> >>>>>> I'm interested in working on this. I'll discuss with Martin
> next
> >>>>>> week when we're both back in the office.
> >>>>>>
> >>>>>> Val
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 08/05/14 07:46, Michael Lawrence wrote:
> >>>>>>
> >>>>>> Hi guys (Val, Martin, Herve):
> >>>>>>
> >>>>>> Anyone have an itch for optimization? The writeVcf function
> >>>>>> is
> >>>>>> currently a
> >>>>>> bottleneck in our WGS genotyping pipeline. For a typical 50
> >>>>>> million row
> >>>>>> gVCF, it was taking 2.25 hours prior to yesterday's
> >>>>>> improvements
> >>>>>> (pasteCollapseRows) that brought it down to about 1 hour,
> >>>>>> which
> >>>>>> is still
> >>>>>> too long by my standards (> 0). Only takes 3 minutes to
> call
> >>>>>> the
> >>>>>> genotypes
> >>>>>> (and associated likelihoods etc) from the variant calls
> >>>>>> (using
> >>>>>> 80 cores and
> >>>>>> 450 GB RAM on one node), so the output is an issue.
> Profiling
> >>>>>> suggests that
> >>>>>> the running time scales non-linearly in the number of rows.
> >>>>>>
> >>>>>> Digging a little deeper, it seems to be something with R's
> >>>>>> string/memory
> >>>>>> allocation. Below, pasting 1 million strings takes 6
> >>>>>> seconds, but
> >>>>>> 10
> >>>>>> million strings takes over 2 minutes. It gets way worse
> with
> >>>>>> 50
> >>>>>> million. I
> >>>>>> suspect it has something to do with R's string hash table.
> >>>>>>
> >>>>>> set.seed(1000)
> >>>>>> end <- sample(1e8, 1e6)
> >>>>>> system.time(paste0("END", "=", end))
> >>>>>> user system elapsed
> >>>>>> 6.396 0.028 6.420
> >>>>>>
> >>>>>> end <- sample(1e8, 1e7)
> >>>>>> system.time(paste0("END", "=", end))
> >>>>>> user system elapsed
> >>>>>> 134.714 0.352 134.978
> >>>>>>
> >>>>>> Indeed, even this takes a long time (in a fresh session):
> >>>>>>
> >>>>>> set.seed(1000)
> >>>>>> end <- sample(1e8, 1e6)
> >>>>>> end <- sample(1e8, 1e7)
> >>>>>> system.time(as.character(end))
> >>>>>> user system elapsed
> >>>>>> 57.224 0.156 57.366
> >>>>>>
> >>>>>> But running it a second time is faster (about what one
> would
> >>>>>> expect?):
> >>>>>>
> >>>>>> system.time(levels <- as.character(end))
> >>>>>> user system elapsed
> >>>>>> 23.582 0.021 23.589
> >>>>>>
> >>>>>> I did some simple profiling of R to find that the resizing
> of
> >>>>>> the string
> >>>>>> hash table is not a significant component of the time. So
> >>>>>> maybe
> >>>>>> something
> >>>>>> to do with the R heap/gc? No time right now to go deeper.
> >>>>>> But I
> >>>>>> know Martin
> >>>>>> likes this sort of thing ;)
> >>>>>>
> >>>>>> Michael
> >>>>>>
> >>>>>> [[alternative HTML version deleted]]
> >>>>>>
> >>>>>> _________________________________________________
> >>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> >>>>>> mailing list
> >>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> >>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >>>>>>
> >>>>>>
> >>>>>> _________________________________________________
> >>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> >>>>>> mailing
> >>>>>> list
> >>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> >>>>>>
> >>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Computational Biologist
> >>>>>> Genentech Research
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> [[alternative HTML version deleted]]
> >>>>
> >>>> _______________________________________________
> >>>> Bioc-devel at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>
> >>>>
> >>>
> >>> --
> >>> Computational Biology / Fred Hutchinson Cancer Research Center
> >>> 1100 Fairview Ave. N.
> >>> PO Box 19024 Seattle, WA 98109
> >>>
> >>> Location: Arnold Building M1 B861
> >>> Phone: (206) 667-2793
> >>>
> >>
> >>
> >
> >
> > --
> > Computational Biologist
> > Genentech Research
> >
>
>
>
> --
> Computational Biologist
> Genentech Research
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list