[Bioc-devel] writeVcf performance
Gabe Becker
becker.gabe at gene.com
Wed Aug 27 20:56:54 CEST 2014
The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.
Sorry for the confusion.
~G
On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com> wrote:
> Martin and Val.
>
> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
> with profiling enabled. The results of summaryRprof for that run are
> attached, though for a variety of reasons they are pretty misleading.
>
> It took over an hour to write (3700+seconds), so it's definitely a
> bottleneck when the data get very large, even if it isn't for smaller data.
>
> Michael and I both think the culprit is all the pasting and cbinding that
> is going on, and more to the point, that memory for an internal
> representation to be written out is allocated at all. Streaming across the
> object, looping by rows and writing directly to file (e.g. from C) should
> be blisteringly fast in comparison.
>
> ~G
>
>
> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com>
> wrote:
>
>> Gabe is still testing/profiling, but we'll send something randomized
>> along eventually.
>>
>>
>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org>
>> wrote:
>>
>>> I didn't see in the original thread a reproducible (simulated, I guess)
>>> example, to be explicit about what the problem is??
>>>
>>> Martin
>>>
>>>
>>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>>
>>>> My understanding is that the heap optimization provided marginal gains,
>>>> and
>>>> that we need to think harder about how to optimize the all of the string
>>>> manipulation in writeVcf. We either need to reduce it or reduce its
>>>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>>>>
>>>>
>>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
>>>> wrote:
>>>>
>>>> Hi Gabe,
>>>>>
>>>>> Martin responded, and so did Michael,
>>>>>
>>>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>>>>>
>>>>> It sounded like Michael was ok with working with/around heap
>>>>> initialization.
>>>>>
>>>>> Michael, is that right or should we still consider this on the table?
>>>>>
>>>>>
>>>>> Val
>>>>>
>>>>>
>>>>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>>>>
>>>>> Val,
>>>>>>
>>>>>> Has there been any movement on this? This remains a substantial
>>>>>> bottleneck for us when writing very large VCF files (e.g.
>>>>>> variants+genotypes for whole genome NGS samples).
>>>>>>
>>>>>> I was able to see a ~25% speedup with 4 cores and an "optimal"
>>>>>> speedup
>>>>>> of ~2x with 10-12 cores for a VCF with 500k rows using a very naive
>>>>>> parallelization strategy and no other changes. I suspect this could be
>>>>>> improved on quite a bit, or possibly made irrelevant with judicious
>>>>>> use
>>>>>> of serial C code.
>>>>>>
>>>>>> Did you and Martin make any plans regarding optimizing writeVcf?
>>>>>>
>>>>>> Best
>>>>>> ~G
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
>>>>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> I'm interested in working on this. I'll discuss with Martin next
>>>>>> week when we're both back in the office.
>>>>>>
>>>>>> Val
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 08/05/14 07:46, Michael Lawrence wrote:
>>>>>>
>>>>>> Hi guys (Val, Martin, Herve):
>>>>>>
>>>>>> Anyone have an itch for optimization? The writeVcf function
>>>>>> is
>>>>>> currently a
>>>>>> bottleneck in our WGS genotyping pipeline. For a typical 50
>>>>>> million row
>>>>>> gVCF, it was taking 2.25 hours prior to yesterday's
>>>>>> improvements
>>>>>> (pasteCollapseRows) that brought it down to about 1 hour,
>>>>>> which
>>>>>> is still
>>>>>> too long by my standards (> 0). Only takes 3 minutes to call
>>>>>> the
>>>>>> genotypes
>>>>>> (and associated likelihoods etc) from the variant calls
>>>>>> (using
>>>>>> 80 cores and
>>>>>> 450 GB RAM on one node), so the output is an issue. Profiling
>>>>>> suggests that
>>>>>> the running time scales non-linearly in the number of rows.
>>>>>>
>>>>>> Digging a little deeper, it seems to be something with R's
>>>>>> string/memory
>>>>>> allocation. Below, pasting 1 million strings takes 6
>>>>>> seconds, but
>>>>>> 10
>>>>>> million strings takes over 2 minutes. It gets way worse with
>>>>>> 50
>>>>>> million. I
>>>>>> suspect it has something to do with R's string hash table.
>>>>>>
>>>>>> set.seed(1000)
>>>>>> end <- sample(1e8, 1e6)
>>>>>> system.time(paste0("END", "=", end))
>>>>>> user system elapsed
>>>>>> 6.396 0.028 6.420
>>>>>>
>>>>>> end <- sample(1e8, 1e7)
>>>>>> system.time(paste0("END", "=", end))
>>>>>> user system elapsed
>>>>>> 134.714 0.352 134.978
>>>>>>
>>>>>> Indeed, even this takes a long time (in a fresh session):
>>>>>>
>>>>>> set.seed(1000)
>>>>>> end <- sample(1e8, 1e6)
>>>>>> end <- sample(1e8, 1e7)
>>>>>> system.time(as.character(end))
>>>>>> user system elapsed
>>>>>> 57.224 0.156 57.366
>>>>>>
>>>>>> But running it a second time is faster (about what one would
>>>>>> expect?):
>>>>>>
>>>>>> system.time(levels <- as.character(end))
>>>>>> user system elapsed
>>>>>> 23.582 0.021 23.589
>>>>>>
>>>>>> I did some simple profiling of R to find that the resizing of
>>>>>> the string
>>>>>> hash table is not a significant component of the time. So
>>>>>> maybe
>>>>>> something
>>>>>> to do with the R heap/gc? No time right now to go deeper.
>>>>>> But I
>>>>>> know Martin
>>>>>> likes this sort of thing ;)
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _________________________________________________
>>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>> mailing list
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>> _________________________________________________
>>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>> mailing
>>>>>> list
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Computational Biologist
>>>>>> Genentech Research
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>
>>
>
>
> --
> Computational Biologist
> Genentech Research
>
--
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list