[Bioc-devel] writeVcf performance

Martin Morgan mtmorgan at fhcrc.org
Thu Aug 14 01:42:29 CEST 2014


On 08/05/2014 07:46 AM, Michael Lawrence wrote:
> Hi guys (Val, Martin, Herve):
>
> Anyone have an itch for optimization? The writeVcf function is currently a
> bottleneck in our WGS genotyping pipeline. For a typical 50 million row
> gVCF, it was taking 2.25 hours prior to yesterday's improvements
> (pasteCollapseRows) that brought it down to about 1 hour, which is still
> too long by my standards (> 0). Only takes 3 minutes to call the genotypes
> (and associated likelihoods etc) from the variant calls (using 80 cores and
> 450 GB RAM on one node), so the output is an issue. Profiling suggests that
> the running time scales non-linearly in the number of rows.
>
> Digging a little deeper, it seems to be something with R's string/memory
> allocation. Below, pasting 1 million strings takes 6 seconds, but 10
> million strings takes over 2 minutes. It gets way worse with 50 million. I
> suspect it has something to do with R's string hash table.
>
> set.seed(1000)
> end <- sample(1e8, 1e6)
> system.time(paste0("END", "=", end))
>     user  system elapsed
>    6.396   0.028   6.420
>
> end <- sample(1e8, 1e7)
> system.time(paste0("END", "=", end))
>     user  system elapsed
> 134.714   0.352 134.978
>
> Indeed, even this takes a long time (in a fresh session):
>
> set.seed(1000)
> end <- sample(1e8, 1e6)
> end <- sample(1e8, 1e7)
> system.time(as.character(end))
>     user  system elapsed
>   57.224   0.156  57.366

my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M, which 
changes the example above from

 > system.time(as.character(end))
    user  system elapsed
  82.835   0.343  83.195

to

 > system.time(as.character(end))
    user  system elapsed
   9.245   0.169   9.424

but I think it's a one-time gain; I wonder what the writeVcf command is that 
you're running?

Martin

>
> But running it a second time is faster (about what one would expect?):
>
> system.time(levels <- as.character(end))
>     user  system elapsed
>   23.582   0.021  23.589
>
> I did some simple profiling of R to find that the resizing of the string
> hash table is not a significant component of the time. So maybe something
> to do with the R heap/gc? No time right now to go deeper. But I know Martin
> likes this sort of thing ;)
>
> Michael
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list