[Bioc-devel] writeVcf performance

Michael Lawrence lawrence.michael at gene.com
Thu Aug 14 12:11:54 CEST 2014


I thought it might come down to the heap initialization. We'll work with
that.


On Wed, Aug 13, 2014 at 4:42 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

> On 08/05/2014 07:46 AM, Michael Lawrence wrote:
>
>> Hi guys (Val, Martin, Herve):
>>
>> Anyone have an itch for optimization? The writeVcf function is currently a
>> bottleneck in our WGS genotyping pipeline. For a typical 50 million row
>> gVCF, it was taking 2.25 hours prior to yesterday's improvements
>> (pasteCollapseRows) that brought it down to about 1 hour, which is still
>> too long by my standards (> 0). Only takes 3 minutes to call the genotypes
>> (and associated likelihoods etc) from the variant calls (using 80 cores
>> and
>> 450 GB RAM on one node), so the output is an issue. Profiling suggests
>> that
>> the running time scales non-linearly in the number of rows.
>>
>> Digging a little deeper, it seems to be something with R's string/memory
>> allocation. Below, pasting 1 million strings takes 6 seconds, but 10
>> million strings takes over 2 minutes. It gets way worse with 50 million. I
>> suspect it has something to do with R's string hash table.
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> system.time(paste0("END", "=", end))
>>     user  system elapsed
>>    6.396   0.028   6.420
>>
>> end <- sample(1e8, 1e7)
>> system.time(paste0("END", "=", end))
>>     user  system elapsed
>> 134.714   0.352 134.978
>>
>> Indeed, even this takes a long time (in a fresh session):
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> end <- sample(1e8, 1e7)
>> system.time(as.character(end))
>>     user  system elapsed
>>   57.224   0.156  57.366
>>
>
> my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M,
> which changes the example above from
>
> > system.time(as.character(end))
>    user  system elapsed
>  82.835   0.343  83.195
>
> to
>
> > system.time(as.character(end))
>    user  system elapsed
>   9.245   0.169   9.424
>
> but I think it's a one-time gain; I wonder what the writeVcf command is
> that you're running?
>
> Martin
>
>
>> But running it a second time is faster (about what one would expect?):
>>
>> system.time(levels <- as.character(end))
>>     user  system elapsed
>>   23.582   0.021  23.589
>>
>> I did some simple profiling of R to find that the resizing of the string
>> hash table is not a significant component of the time. So maybe something
>> to do with the R heap/gc? No time right now to go deeper. But I know
>> Martin
>> likes this sort of thing ;)
>>
>> Michael
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list