[Bioc-devel] writeVcf performance

Michael Lawrence lawrence.michael at gene.com
Tue Aug 26 20:57:19 CEST 2014


Gabe is still testing/profiling, but we'll send something randomized along
eventually.


On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

> I didn't see in the original thread a reproducible (simulated, I guess)
> example, to be explicit about what the problem is??
>
> Martin
>
>
> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>
>> My understanding is that the heap optimization provided marginal gains,
>> and
>> that we need to think harder about how to optimize the all of the string
>> manipulation in writeVcf. We either need to reduce it or reduce its
>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>>
>>
>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
>> wrote:
>>
>>  Hi Gabe,
>>>
>>> Martin responded, and so did Michael,
>>>
>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>>>
>>> It sounded like Michael was ok with working with/around heap
>>> initialization.
>>>
>>> Michael, is that right or should we still consider this on the table?
>>>
>>>
>>> Val
>>>
>>>
>>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>>
>>>  Val,
>>>>
>>>> Has there been any movement on this? This remains a substantial
>>>> bottleneck for us when writing very large VCF files (e.g.
>>>> variants+genotypes for whole genome NGS samples).
>>>>
>>>> I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
>>>> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
>>>> parallelization strategy and no other changes. I suspect this could be
>>>> improved on quite a bit, or possibly made irrelevant with judicious use
>>>> of serial C code.
>>>>
>>>> Did you and Martin make any plans regarding optimizing writeVcf?
>>>>
>>>> Best
>>>> ~G
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
>>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>>
>>>>      Hi Michael,
>>>>
>>>>      I'm interested in working on this. I'll discuss with Martin next
>>>>      week when we're both back in the office.
>>>>
>>>>      Val
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>      On 08/05/14 07:46, Michael Lawrence wrote:
>>>>
>>>>          Hi guys (Val, Martin, Herve):
>>>>
>>>>          Anyone have an itch for optimization? The writeVcf function is
>>>>          currently a
>>>>          bottleneck in our WGS genotyping pipeline. For a typical 50
>>>>          million row
>>>>          gVCF, it was taking 2.25 hours prior to yesterday's
>>>> improvements
>>>>          (pasteCollapseRows) that brought it down to about 1 hour, which
>>>>          is still
>>>>          too long by my standards (> 0). Only takes 3 minutes to call
>>>> the
>>>>          genotypes
>>>>          (and associated likelihoods etc) from the variant calls (using
>>>>          80 cores and
>>>>          450 GB RAM on one node), so the output is an issue. Profiling
>>>>          suggests that
>>>>          the running time scales non-linearly in the number of rows.
>>>>
>>>>          Digging a little deeper, it seems to be something with R's
>>>>          string/memory
>>>>          allocation. Below, pasting 1 million strings takes 6 seconds,
>>>> but
>>>> 10
>>>>          million strings takes over 2 minutes. It gets way worse with 50
>>>>          million. I
>>>>          suspect it has something to do with R's string hash table.
>>>>
>>>>          set.seed(1000)
>>>>          end <- sample(1e8, 1e6)
>>>>          system.time(paste0("END", "=", end))
>>>>               user  system elapsed
>>>>              6.396   0.028   6.420
>>>>
>>>>          end <- sample(1e8, 1e7)
>>>>          system.time(paste0("END", "=", end))
>>>>               user  system elapsed
>>>>          134.714   0.352 134.978
>>>>
>>>>          Indeed, even this takes a long time (in a fresh session):
>>>>
>>>>          set.seed(1000)
>>>>          end <- sample(1e8, 1e6)
>>>>          end <- sample(1e8, 1e7)
>>>>          system.time(as.character(end))
>>>>               user  system elapsed
>>>>             57.224   0.156  57.366
>>>>
>>>>          But running it a second time is faster (about what one would
>>>>          expect?):
>>>>
>>>>          system.time(levels <- as.character(end))
>>>>               user  system elapsed
>>>>             23.582   0.021  23.589
>>>>
>>>>          I did some simple profiling of R to find that the resizing of
>>>>          the string
>>>>          hash table is not a significant component of the time. So maybe
>>>>          something
>>>>          to do with the R heap/gc? No time right now to go deeper. But I
>>>>          know Martin
>>>>          likes this sort of thing ;)
>>>>
>>>>          Michael
>>>>
>>>>                   [[alternative HTML version deleted]]
>>>>
>>>>          _________________________________________________
>>>>          Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>          mailing list
>>>>          https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>          <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>>      _________________________________________________
>>>>      Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>>> list
>>>>      https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>>      <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biologist
>>>> Genentech Research
>>>>
>>>>
>>>
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list