[Bioc-devel] writeVcf performance
Martin Morgan
mtmorgan at fhcrc.org
Tue Aug 26 20:15:06 CEST 2014
I didn't see in the original thread a reproducible (simulated, I guess) example,
to be explicit about what the problem is??
Martin
On 08/26/2014 10:47 AM, Michael Lawrence wrote:
> My understanding is that the heap optimization provided marginal gains, and
> that we need to think harder about how to optimize the all of the string
> manipulation in writeVcf. We either need to reduce it or reduce its
> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>
>
> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
> wrote:
>
>> Hi Gabe,
>>
>> Martin responded, and so did Michael,
>>
>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>>
>> It sounded like Michael was ok with working with/around heap
>> initialization.
>>
>> Michael, is that right or should we still consider this on the table?
>>
>>
>> Val
>>
>>
>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>
>>> Val,
>>>
>>> Has there been any movement on this? This remains a substantial
>>> bottleneck for us when writing very large VCF files (e.g.
>>> variants+genotypes for whole genome NGS samples).
>>>
>>> I was able to see a ~25% speedup with 4 cores and an "optimal" speedup
>>> of ~2x with 10-12 cores for a VCF with 500k rows using a very naive
>>> parallelization strategy and no other changes. I suspect this could be
>>> improved on quite a bit, or possibly made irrelevant with judicious use
>>> of serial C code.
>>>
>>> Did you and Martin make any plans regarding optimizing writeVcf?
>>>
>>> Best
>>> ~G
>>>
>>>
>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>
>>> Hi Michael,
>>>
>>> I'm interested in working on this. I'll discuss with Martin next
>>> week when we're both back in the office.
>>>
>>> Val
>>>
>>>
>>>
>>>
>>>
>>> On 08/05/14 07:46, Michael Lawrence wrote:
>>>
>>> Hi guys (Val, Martin, Herve):
>>>
>>> Anyone have an itch for optimization? The writeVcf function is
>>> currently a
>>> bottleneck in our WGS genotyping pipeline. For a typical 50
>>> million row
>>> gVCF, it was taking 2.25 hours prior to yesterday's improvements
>>> (pasteCollapseRows) that brought it down to about 1 hour, which
>>> is still
>>> too long by my standards (> 0). Only takes 3 minutes to call the
>>> genotypes
>>> (and associated likelihoods etc) from the variant calls (using
>>> 80 cores and
>>> 450 GB RAM on one node), so the output is an issue. Profiling
>>> suggests that
>>> the running time scales non-linearly in the number of rows.
>>>
>>> Digging a little deeper, it seems to be something with R's
>>> string/memory
>>> allocation. Below, pasting 1 million strings takes 6 seconds, but
>>> 10
>>> million strings takes over 2 minutes. It gets way worse with 50
>>> million. I
>>> suspect it has something to do with R's string hash table.
>>>
>>> set.seed(1000)
>>> end <- sample(1e8, 1e6)
>>> system.time(paste0("END", "=", end))
>>> user system elapsed
>>> 6.396 0.028 6.420
>>>
>>> end <- sample(1e8, 1e7)
>>> system.time(paste0("END", "=", end))
>>> user system elapsed
>>> 134.714 0.352 134.978
>>>
>>> Indeed, even this takes a long time (in a fresh session):
>>>
>>> set.seed(1000)
>>> end <- sample(1e8, 1e6)
>>> end <- sample(1e8, 1e7)
>>> system.time(as.character(end))
>>> user system elapsed
>>> 57.224 0.156 57.366
>>>
>>> But running it a second time is faster (about what one would
>>> expect?):
>>>
>>> system.time(levels <- as.character(end))
>>> user system elapsed
>>> 23.582 0.021 23.589
>>>
>>> I did some simple profiling of R to find that the resizing of
>>> the string
>>> hash table is not a significant component of the time. So maybe
>>> something
>>> to do with the R heap/gc? No time right now to go deeper. But I
>>> know Martin
>>> likes this sort of thing ;)
>>>
>>> Michael
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _________________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> mailing list
>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>> _________________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>> list
>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list