[Bioc-devel] writeVcf performance
Michael Lawrence
lawrence.michael at gene.com
Tue Aug 26 20:57:19 CEST 2014
Gabe is still testing/profiling, but we'll send something randomized along
eventually.
On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> I didn't see in the original thread a reproducible (simulated, I guess)
> example, to be explicit about what the problem is??
>
> Martin
>
>
> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>
>> My understanding is that the heap optimization provided marginal gains,
>> and
>> that we need to think harder about how to optimize the all of the string
>> manipulation in writeVcf. We either need to reduce it or reduce its
>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>>
>>
>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
>> wrote:
>>
>> Hi Gabe,
>>>
>>> Martin responded, and so did Michael,
>>>
>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>>>
>>> It sounded like Michael was ok with working with/around heap
>>> initialization.
>>>
>>> Michael, is that right or should we still consider this on the table?
>>>
>>>
>>> Val
>>>
>>>
>>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>>
>>> Val,
>>>>
>>>> Has there been any movement on this? This remains a substantial
>>>> bottleneck for us when writing very large VCF files (e.g.
>>>> variants+genotypes for whole genome NGS samples).
>>>>
>>>> I was able to see a ~25% speedup with 4 cores and an "optimal" speedup
>>>> of ~2x with 10-12 cores for a VCF with 500k rows using a very naive
>>>> parallelization strategy and no other changes. I suspect this could be
>>>> improved on quite a bit, or possibly made irrelevant with judicious use
>>>> of serial C code.
>>>>
>>>> Did you and Martin make any plans regarding optimizing writeVcf?
>>>>
>>>> Best
>>>> ~G
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
>>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>>
>>>> Hi Michael,
>>>>
>>>> I'm interested in working on this. I'll discuss with Martin next
>>>> week when we're both back in the office.
>>>>
>>>> Val
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 08/05/14 07:46, Michael Lawrence wrote:
>>>>
>>>> Hi guys (Val, Martin, Herve):
>>>>
>>>> Anyone have an itch for optimization? The writeVcf function is
>>>> currently a
>>>> bottleneck in our WGS genotyping pipeline. For a typical 50
>>>> million row
>>>> gVCF, it was taking 2.25 hours prior to yesterday's
>>>> improvements
>>>> (pasteCollapseRows) that brought it down to about 1 hour, which
>>>> is still
>>>> too long by my standards (> 0). Only takes 3 minutes to call
>>>> the
>>>> genotypes
>>>> (and associated likelihoods etc) from the variant calls (using
>>>> 80 cores and
>>>> 450 GB RAM on one node), so the output is an issue. Profiling
>>>> suggests that
>>>> the running time scales non-linearly in the number of rows.
>>>>
>>>> Digging a little deeper, it seems to be something with R's
>>>> string/memory
>>>> allocation. Below, pasting 1 million strings takes 6 seconds,
>>>> but
>>>> 10
>>>> million strings takes over 2 minutes. It gets way worse with 50
>>>> million. I
>>>> suspect it has something to do with R's string hash table.
>>>>
>>>> set.seed(1000)
>>>> end <- sample(1e8, 1e6)
>>>> system.time(paste0("END", "=", end))
>>>> user system elapsed
>>>> 6.396 0.028 6.420
>>>>
>>>> end <- sample(1e8, 1e7)
>>>> system.time(paste0("END", "=", end))
>>>> user system elapsed
>>>> 134.714 0.352 134.978
>>>>
>>>> Indeed, even this takes a long time (in a fresh session):
>>>>
>>>> set.seed(1000)
>>>> end <- sample(1e8, 1e6)
>>>> end <- sample(1e8, 1e7)
>>>> system.time(as.character(end))
>>>> user system elapsed
>>>> 57.224 0.156 57.366
>>>>
>>>> But running it a second time is faster (about what one would
>>>> expect?):
>>>>
>>>> system.time(levels <- as.character(end))
>>>> user system elapsed
>>>> 23.582 0.021 23.589
>>>>
>>>> I did some simple profiling of R to find that the resizing of
>>>> the string
>>>> hash table is not a significant component of the time. So maybe
>>>> something
>>>> to do with the R heap/gc? No time right now to go deeper. But I
>>>> know Martin
>>>> likes this sort of thing ;)
>>>>
>>>> Michael
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _________________________________________________
>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>> _________________________________________________
>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>>> list
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biologist
>>>> Genentech Research
>>>>
>>>>
>>>
>>>
>>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list