[Bioc-devel] writeVcf performance

Valerie Obenchain vobencha at fhcrc.org
Tue Aug 5 23:33:33 CEST 2014


Hi Michael,

I'm interested in working on this. I'll discuss with Martin next week 
when we're both back in the office.

Val




On 08/05/14 07:46, Michael Lawrence wrote:
> Hi guys (Val, Martin, Herve):
>
> Anyone have an itch for optimization? The writeVcf function is currently a
> bottleneck in our WGS genotyping pipeline. For a typical 50 million row
> gVCF, it was taking 2.25 hours prior to yesterday's improvements
> (pasteCollapseRows) that brought it down to about 1 hour, which is still
> too long by my standards (> 0). Only takes 3 minutes to call the genotypes
> (and associated likelihoods etc) from the variant calls (using 80 cores and
> 450 GB RAM on one node), so the output is an issue. Profiling suggests that
> the running time scales non-linearly in the number of rows.
>
> Digging a little deeper, it seems to be something with R's string/memory
> allocation. Below, pasting 1 million strings takes 6 seconds, but 10
> million strings takes over 2 minutes. It gets way worse with 50 million. I
> suspect it has something to do with R's string hash table.
>
> set.seed(1000)
> end <- sample(1e8, 1e6)
> system.time(paste0("END", "=", end))
>     user  system elapsed
>    6.396   0.028   6.420
>
> end <- sample(1e8, 1e7)
> system.time(paste0("END", "=", end))
>     user  system elapsed
> 134.714   0.352 134.978
>
> Indeed, even this takes a long time (in a fresh session):
>
> set.seed(1000)
> end <- sample(1e8, 1e6)
> end <- sample(1e8, 1e7)
> system.time(as.character(end))
>     user  system elapsed
>   57.224   0.156  57.366
>
> But running it a second time is faster (about what one would expect?):
>
> system.time(levels <- as.character(end))
>     user  system elapsed
>   23.582   0.021  23.589
>
> I did some simple profiling of R to find that the resizing of the string
> hash table is not a significant component of the time. So maybe something
> to do with the R heap/gc? No time right now to go deeper. But I know Martin
> likes this sort of thing ;)
>
> Michael
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list