[Bioc-devel] writeVcf performance

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Fri Aug 29 19:42:37 CEST 2014


Try to run it through the lineprof package for memory profiling; I have
found this to be very helpful.

Here is an old blog post I wrote about it
  http://www.hansenlab.org/rstats/2014/01/30/lineprof/

Kasper


On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker <becker.gabe at gene.com> wrote:

> The profiling I attached in my previous email is for 24 geno fields, as I
> said, but our typical usecase involves only ~4-6 fields, and is faster but
> still on the order of dozens of minutes.
>
> Sorry for the confusion.
> ~G
>
>
> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com> wrote:
>
> > Martin and Val.
> >
> > I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
> > with profiling enabled. The results of summaryRprof for that run are
> > attached, though for a variety of reasons they are pretty misleading.
> >
> > It took over an hour to write (3700+seconds), so it's definitely a
> > bottleneck when the data get very large, even if it isn't for smaller
> data.
> >
> > Michael and I both think the culprit is all the pasting and cbinding that
> > is going on, and more to the point, that memory for an internal
> > representation to be written out is allocated at all.  Streaming across
> the
> > object, looping by rows and writing directly to file (e.g. from C) should
> > be blisteringly fast in comparison.
> >
> > ~G
> >
> >
> > On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com>
> > wrote:
> >
> >> Gabe is still testing/profiling, but we'll send something randomized
> >> along eventually.
> >>
> >>
> >> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org>
> >> wrote:
> >>
> >>> I didn't see in the original thread a reproducible (simulated, I guess)
> >>> example, to be explicit about what the problem is??
> >>>
> >>> Martin
> >>>
> >>>
> >>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
> >>>
> >>>> My understanding is that the heap optimization provided marginal
> gains,
> >>>> and
> >>>> that we need to think harder about how to optimize the all of the
> string
> >>>> manipulation in writeVcf. We either need to reduce it or reduce its
> >>>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
> >>>>
> >>>>
> >>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <
> vobencha at fhcrc.org>
> >>>> wrote:
> >>>>
> >>>>  Hi Gabe,
> >>>>>
> >>>>> Martin responded, and so did Michael,
> >>>>>
> >>>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
> >>>>>
> >>>>> It sounded like Michael was ok with working with/around heap
> >>>>> initialization.
> >>>>>
> >>>>> Michael, is that right or should we still consider this on the table?
> >>>>>
> >>>>>
> >>>>> Val
> >>>>>
> >>>>>
> >>>>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
> >>>>>
> >>>>>  Val,
> >>>>>>
> >>>>>> Has there been any movement on this? This remains a substantial
> >>>>>> bottleneck for us when writing very large VCF files (e.g.
> >>>>>> variants+genotypes for whole genome NGS samples).
> >>>>>>
> >>>>>> I was able to see a ~25% speedup with 4 cores and  an "optimal"
> >>>>>> speedup
> >>>>>> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
> >>>>>> parallelization strategy and no other changes. I suspect this could
> be
> >>>>>> improved on quite a bit, or possibly made irrelevant with judicious
> >>>>>> use
> >>>>>> of serial C code.
> >>>>>>
> >>>>>> Did you and Martin make any plans regarding optimizing writeVcf?
> >>>>>>
> >>>>>> Best
> >>>>>> ~G
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <
> vobencha at fhcrc.org
> >>>>>> <mailto:vobencha at fhcrc.org>> wrote:
> >>>>>>
> >>>>>>      Hi Michael,
> >>>>>>
> >>>>>>      I'm interested in working on this. I'll discuss with Martin
> next
> >>>>>>      week when we're both back in the office.
> >>>>>>
> >>>>>>      Val
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>      On 08/05/14 07:46, Michael Lawrence wrote:
> >>>>>>
> >>>>>>          Hi guys (Val, Martin, Herve):
> >>>>>>
> >>>>>>          Anyone have an itch for optimization? The writeVcf function
> >>>>>> is
> >>>>>>          currently a
> >>>>>>          bottleneck in our WGS genotyping pipeline. For a typical 50
> >>>>>>          million row
> >>>>>>          gVCF, it was taking 2.25 hours prior to yesterday's
> >>>>>> improvements
> >>>>>>          (pasteCollapseRows) that brought it down to about 1 hour,
> >>>>>> which
> >>>>>>          is still
> >>>>>>          too long by my standards (> 0). Only takes 3 minutes to
> call
> >>>>>> the
> >>>>>>          genotypes
> >>>>>>          (and associated likelihoods etc) from the variant calls
> >>>>>> (using
> >>>>>>          80 cores and
> >>>>>>          450 GB RAM on one node), so the output is an issue.
> Profiling
> >>>>>>          suggests that
> >>>>>>          the running time scales non-linearly in the number of rows.
> >>>>>>
> >>>>>>          Digging a little deeper, it seems to be something with R's
> >>>>>>          string/memory
> >>>>>>          allocation. Below, pasting 1 million strings takes 6
> >>>>>> seconds, but
> >>>>>> 10
> >>>>>>          million strings takes over 2 minutes. It gets way worse
> with
> >>>>>> 50
> >>>>>>          million. I
> >>>>>>          suspect it has something to do with R's string hash table.
> >>>>>>
> >>>>>>          set.seed(1000)
> >>>>>>          end <- sample(1e8, 1e6)
> >>>>>>          system.time(paste0("END", "=", end))
> >>>>>>               user  system elapsed
> >>>>>>              6.396   0.028   6.420
> >>>>>>
> >>>>>>          end <- sample(1e8, 1e7)
> >>>>>>          system.time(paste0("END", "=", end))
> >>>>>>               user  system elapsed
> >>>>>>          134.714   0.352 134.978
> >>>>>>
> >>>>>>          Indeed, even this takes a long time (in a fresh session):
> >>>>>>
> >>>>>>          set.seed(1000)
> >>>>>>          end <- sample(1e8, 1e6)
> >>>>>>          end <- sample(1e8, 1e7)
> >>>>>>          system.time(as.character(end))
> >>>>>>               user  system elapsed
> >>>>>>             57.224   0.156  57.366
> >>>>>>
> >>>>>>          But running it a second time is faster (about what one
> would
> >>>>>>          expect?):
> >>>>>>
> >>>>>>          system.time(levels <- as.character(end))
> >>>>>>               user  system elapsed
> >>>>>>             23.582   0.021  23.589
> >>>>>>
> >>>>>>          I did some simple profiling of R to find that the resizing
> of
> >>>>>>          the string
> >>>>>>          hash table is not a significant component of the time. So
> >>>>>> maybe
> >>>>>>          something
> >>>>>>          to do with the R heap/gc? No time right now to go deeper.
> >>>>>> But I
> >>>>>>          know Martin
> >>>>>>          likes this sort of thing ;)
> >>>>>>
> >>>>>>          Michael
> >>>>>>
> >>>>>>                   [[alternative HTML version deleted]]
> >>>>>>
> >>>>>>          _________________________________________________
> >>>>>>          Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> >>>>>>          mailing list
> >>>>>>          https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> >>>>>>          <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >>>>>>
> >>>>>>
> >>>>>>      _________________________________________________
> >>>>>>      Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> >>>>>> mailing
> >>>>>> list
> >>>>>>      https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> >>>>>>
> >>>>>>      <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Computational Biologist
> >>>>>> Genentech Research
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>         [[alternative HTML version deleted]]
> >>>>
> >>>> _______________________________________________
> >>>> Bioc-devel at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>>
> >>>>
> >>>
> >>> --
> >>> Computational Biology / Fred Hutchinson Cancer Research Center
> >>> 1100 Fairview Ave. N.
> >>> PO Box 19024 Seattle, WA 98109
> >>>
> >>> Location: Arnold Building M1 B861
> >>> Phone: (206) 667-2793
> >>>
> >>
> >>
> >
> >
> > --
> > Computational Biologist
> > Genentech Research
> >
>
>
>
> --
> Computational Biologist
> Genentech Research
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list