[Bioc-devel] writeVcf performance

Wed Aug 27 20:56:54 CEST 2014

The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.

Sorry for the confusion.
~G

On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com> wrote:

> Martin and Val.
>
> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
> with profiling enabled. The results of summaryRprof for that run are
> attached, though for a variety of reasons they are pretty misleading.
>
> It took over an hour to write (3700+seconds), so it's definitely a
> bottleneck when the data get very large, even if it isn't for smaller data.
>
> Michael and I both think the culprit is all the pasting and cbinding that
> is going on, and more to the point, that memory for an internal
> representation to be written out is allocated at all.  Streaming across the
> object, looping by rows and writing directly to file (e.g. from C) should
> be blisteringly fast in comparison.
>
> ~G
>
>
> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com>
> wrote:
>
>> Gabe is still testing/profiling, but we'll send something randomized
>> along eventually.
>>
>>
>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org>
>> wrote:
>>
>>> I didn't see in the original thread a reproducible (simulated, I guess)
>>> example, to be explicit about what the problem is??
>>>
>>> Martin
>>>
>>>
>>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>>
>>>> My understanding is that the heap optimization provided marginal gains,
>>>> and
>>>> that we need to think harder about how to optimize the all of the string
>>>> manipulation in writeVcf. We either need to reduce it or reduce its
>>>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>>>>
>>>>
>>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
>>>> wrote:
>>>>
>>>>  Hi Gabe,
>>>>>
>>>>> Martin responded, and so did Michael,
>>>>>
>>>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>>>>>
>>>>> It sounded like Michael was ok with working with/around heap
>>>>> initialization.
>>>>>
>>>>> Michael, is that right or should we still consider this on the table?
>>>>>
>>>>>
>>>>> Val
>>>>>
>>>>>
>>>>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>>>>
>>>>>  Val,
>>>>>>
>>>>>> Has there been any movement on this? This remains a substantial
>>>>>> bottleneck for us when writing very large VCF files (e.g.
>>>>>> variants+genotypes for whole genome NGS samples).
>>>>>>
>>>>>> I was able to see a ~25% speedup with 4 cores and  an "optimal"
>>>>>> speedup
>>>>>> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
>>>>>> parallelization strategy and no other changes. I suspect this could be
>>>>>> improved on quite a bit, or possibly made irrelevant with judicious
>>>>>> use
>>>>>> of serial C code.
>>>>>>
>>>>>> Did you and Martin make any plans regarding optimizing writeVcf?
>>>>>>
>>>>>> Best
>>>>>> ~G
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
>>>>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>>>>
>>>>>>      Hi Michael,
>>>>>>
>>>>>>      I'm interested in working on this. I'll discuss with Martin next
>>>>>>      week when we're both back in the office.
>>>>>>
>>>>>>      Val
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      On 08/05/14 07:46, Michael Lawrence wrote:
>>>>>>
>>>>>>          Hi guys (Val, Martin, Herve):
>>>>>>
>>>>>>          Anyone have an itch for optimization? The writeVcf function
>>>>>> is
>>>>>>          currently a
>>>>>>          bottleneck in our WGS genotyping pipeline. For a typical 50
>>>>>>          million row
>>>>>>          gVCF, it was taking 2.25 hours prior to yesterday's
>>>>>> improvements
>>>>>>          (pasteCollapseRows) that brought it down to about 1 hour,
>>>>>> which
>>>>>>          is still
>>>>>>          too long by my standards (> 0). Only takes 3 minutes to call
>>>>>> the
>>>>>>          genotypes
>>>>>>          (and associated likelihoods etc) from the variant calls
>>>>>> (using
>>>>>>          80 cores and
>>>>>>          450 GB RAM on one node), so the output is an issue. Profiling
>>>>>>          suggests that
>>>>>>          the running time scales non-linearly in the number of rows.
>>>>>>
>>>>>>          Digging a little deeper, it seems to be something with R's
>>>>>>          string/memory
>>>>>>          allocation. Below, pasting 1 million strings takes 6
>>>>>> seconds, but
>>>>>> 10
>>>>>>          million strings takes over 2 minutes. It gets way worse with
>>>>>> 50
>>>>>>          million. I
>>>>>>          suspect it has something to do with R's string hash table.
>>>>>>
>>>>>>          set.seed(1000)
>>>>>>          end <- sample(1e8, 1e6)
>>>>>>          system.time(paste0("END", "=", end))
>>>>>>               user  system elapsed
>>>>>>              6.396   0.028   6.420
>>>>>>
>>>>>>          end <- sample(1e8, 1e7)
>>>>>>          system.time(paste0("END", "=", end))
>>>>>>               user  system elapsed
>>>>>>          134.714   0.352 134.978
>>>>>>
>>>>>>          Indeed, even this takes a long time (in a fresh session):
>>>>>>
>>>>>>          set.seed(1000)
>>>>>>          end <- sample(1e8, 1e6)
>>>>>>          end <- sample(1e8, 1e7)
>>>>>>          system.time(as.character(end))
>>>>>>               user  system elapsed
>>>>>>             57.224   0.156  57.366
>>>>>>
>>>>>>          But running it a second time is faster (about what one would
>>>>>>          expect?):
>>>>>>
>>>>>>          system.time(levels <- as.character(end))
>>>>>>               user  system elapsed
>>>>>>             23.582   0.021  23.589
>>>>>>
>>>>>>          I did some simple profiling of R to find that the resizing of
>>>>>>          the string
>>>>>>          hash table is not a significant component of the time. So
>>>>>> maybe
>>>>>>          something
>>>>>>          to do with the R heap/gc? No time right now to go deeper.
>>>>>> But I
>>>>>>          know Martin
>>>>>>          likes this sort of thing ;)
>>>>>>
>>>>>>          Michael
>>>>>>
>>>>>>                   [[alternative HTML version deleted]]
>>>>>>
>>>>>>          _________________________________________________
>>>>>>          Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>>          mailing list
>>>>>>          https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>          <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>>      _________________________________________________
>>>>>>      Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>> mailing
>>>>>> list
>>>>>>      https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>>      <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Computational Biologist
>>>>>> Genentech Research
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>
>>
>
>
> --
> Computational Biologist
> Genentech Research
>

-- 
Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]