[Bioc-devel] writeVcf performance

Martin Morgan mtmorgan at fhcrc.org
Tue Aug 26 20:15:06 CEST 2014


I didn't see in the original thread a reproducible (simulated, I guess) example, 
to be explicit about what the problem is??

Martin

On 08/26/2014 10:47 AM, Michael Lawrence wrote:
> My understanding is that the heap optimization provided marginal gains, and
> that we need to think harder about how to optimize the all of the string
> manipulation in writeVcf. We either need to reduce it or reduce its
> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>
>
> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
> wrote:
>
>> Hi Gabe,
>>
>> Martin responded, and so did Michael,
>>
>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>>
>> It sounded like Michael was ok with working with/around heap
>> initialization.
>>
>> Michael, is that right or should we still consider this on the table?
>>
>>
>> Val
>>
>>
>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>
>>> Val,
>>>
>>> Has there been any movement on this? This remains a substantial
>>> bottleneck for us when writing very large VCF files (e.g.
>>> variants+genotypes for whole genome NGS samples).
>>>
>>> I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
>>> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
>>> parallelization strategy and no other changes. I suspect this could be
>>> improved on quite a bit, or possibly made irrelevant with judicious use
>>> of serial C code.
>>>
>>> Did you and Martin make any plans regarding optimizing writeVcf?
>>>
>>> Best
>>> ~G
>>>
>>>
>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>
>>>      Hi Michael,
>>>
>>>      I'm interested in working on this. I'll discuss with Martin next
>>>      week when we're both back in the office.
>>>
>>>      Val
>>>
>>>
>>>
>>>
>>>
>>>      On 08/05/14 07:46, Michael Lawrence wrote:
>>>
>>>          Hi guys (Val, Martin, Herve):
>>>
>>>          Anyone have an itch for optimization? The writeVcf function is
>>>          currently a
>>>          bottleneck in our WGS genotyping pipeline. For a typical 50
>>>          million row
>>>          gVCF, it was taking 2.25 hours prior to yesterday's improvements
>>>          (pasteCollapseRows) that brought it down to about 1 hour, which
>>>          is still
>>>          too long by my standards (> 0). Only takes 3 minutes to call the
>>>          genotypes
>>>          (and associated likelihoods etc) from the variant calls (using
>>>          80 cores and
>>>          450 GB RAM on one node), so the output is an issue. Profiling
>>>          suggests that
>>>          the running time scales non-linearly in the number of rows.
>>>
>>>          Digging a little deeper, it seems to be something with R's
>>>          string/memory
>>>          allocation. Below, pasting 1 million strings takes 6 seconds, but
>>> 10
>>>          million strings takes over 2 minutes. It gets way worse with 50
>>>          million. I
>>>          suspect it has something to do with R's string hash table.
>>>
>>>          set.seed(1000)
>>>          end <- sample(1e8, 1e6)
>>>          system.time(paste0("END", "=", end))
>>>               user  system elapsed
>>>              6.396   0.028   6.420
>>>
>>>          end <- sample(1e8, 1e7)
>>>          system.time(paste0("END", "=", end))
>>>               user  system elapsed
>>>          134.714   0.352 134.978
>>>
>>>          Indeed, even this takes a long time (in a fresh session):
>>>
>>>          set.seed(1000)
>>>          end <- sample(1e8, 1e6)
>>>          end <- sample(1e8, 1e7)
>>>          system.time(as.character(end))
>>>               user  system elapsed
>>>             57.224   0.156  57.366
>>>
>>>          But running it a second time is faster (about what one would
>>>          expect?):
>>>
>>>          system.time(levels <- as.character(end))
>>>               user  system elapsed
>>>             23.582   0.021  23.589
>>>
>>>          I did some simple profiling of R to find that the resizing of
>>>          the string
>>>          hash table is not a significant component of the time. So maybe
>>>          something
>>>          to do with the R heap/gc? No time right now to go deeper. But I
>>>          know Martin
>>>          likes this sort of thing ;)
>>>
>>>          Michael
>>>
>>>                   [[alternative HTML version deleted]]
>>>
>>>          _________________________________________________
>>>          Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>          mailing list
>>>          https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>          <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>>      _________________________________________________
>>>      Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>> list
>>>      https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>
>>>      <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list