[Bioc-devel] writeVcf performance

Tue Aug 26 18:34:35 CEST 2014

Val,

Has there been any movement on this? This remains a substantial bottleneck
for us when writing very large VCF files (e.g. variants+genotypes for whole
genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup of
~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use of
serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

> Hi Michael,
>
> I'm interested in working on this. I'll discuss with Martin next week when
> we're both back in the office.
>
> Val
>
>
>
>
>
> On 08/05/14 07:46, Michael Lawrence wrote:
>
>> Hi guys (Val, Martin, Herve):
>>
>> Anyone have an itch for optimization? The writeVcf function is currently a
>> bottleneck in our WGS genotyping pipeline. For a typical 50 million row
>> gVCF, it was taking 2.25 hours prior to yesterday's improvements
>> (pasteCollapseRows) that brought it down to about 1 hour, which is still
>> too long by my standards (> 0). Only takes 3 minutes to call the genotypes
>> (and associated likelihoods etc) from the variant calls (using 80 cores
>> and
>> 450 GB RAM on one node), so the output is an issue. Profiling suggests
>> that
>> the running time scales non-linearly in the number of rows.
>>
>> Digging a little deeper, it seems to be something with R's string/memory
>> allocation. Below, pasting 1 million strings takes 6 seconds, but 10
>> million strings takes over 2 minutes. It gets way worse with 50 million. I
>> suspect it has something to do with R's string hash table.
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> system.time(paste0("END", "=", end))
>>     user  system elapsed
>>    6.396   0.028   6.420
>>
>> end <- sample(1e8, 1e7)
>> system.time(paste0("END", "=", end))
>>     user  system elapsed
>> 134.714   0.352 134.978
>>
>> Indeed, even this takes a long time (in a fresh session):
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> end <- sample(1e8, 1e7)
>> system.time(as.character(end))
>>     user  system elapsed
>>   57.224   0.156  57.366
>>
>> But running it a second time is faster (about what one would expect?):
>>
>> system.time(levels <- as.character(end))
>>     user  system elapsed
>>   23.582   0.021  23.589
>>
>> I did some simple profiling of R to find that the resizing of the string
>> hash table is not a significant component of the time. So maybe something
>> to do with the R heap/gc? No time right now to go deeper. But I know
>> Martin
>> likes this sort of thing ;)
>>
>> Michael
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]