[Bioc-devel] writeVcf performance

Tue Sep 2 22:33:25 CEST 2014

On 08/27/2014 11:56 AM, Gabe Becker wrote:
> The profiling I attached in my previous email is for 24 geno fields, as I said,
> but our typical usecase involves only ~4-6 fields, and is faster but still on
> the order of dozens of minutes.

I think Val is arriving at a (much) more efficient implementation, but...

I wanted to share my guess that the poor _scaling_ is because the garbage 
collector runs multiple times as the different strings are pasted together, and 
has to traverse, in linear time, increasing numbers of allocated SEXPs. So times 
scale approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by writing out in chunks 
-- as each chunk is written, the SEXPs become available for collection and are 
re-used. Here's my toy example

time.R
======
splitIndices <- function (nx, ncl)
{
     i <- seq_len(nx)
     if (ncl == 0L)
         list()
     else if (ncl == 1L || nx == 1L)
         list(i)
     else {
         fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
         breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
         structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
     }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na(Sys.getenv("SPLIT", NA))) {
     idx <- splitIndices(length(x), 20)
     system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
     system.time(paste(x, y, sep=":"))
}

running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the 
relevant time is

    user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

    user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with character data -- 
processing in independent chunks of moderate (1M?) size (enabling as a 
consequence parallel evaluation in modest memory) that are sufficient to benefit 
from vectorization, but that do not entail allocation of large numbers of in-use 
SEXPs.

Martin

>
> Sorry for the confusion.
> ~G
>
>
> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
> <mailto:beckerg4 at gene.com>> wrote:
>
>     Martin and Val.
>
>     I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
>     profiling enabled. The results of summaryRprof for that run are attached,
>     though for a variety of reasons they are pretty misleading.
>
>     It took over an hour to write (3700+seconds), so it's definitely a
>     bottleneck when the data get very large, even if it isn't for smaller data.
>
>     Michael and I both think the culprit is all the pasting and cbinding that is
>     going on, and more to the point, that memory for an internal representation
>     to be written out is allocated at all.  Streaming across the object, looping
>     by rows and writing directly to file (e.g. from C) should be blisteringly
>     fast in comparison.
>
>     ~G
>
>
>     On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com
>     <mailto:michafla at gene.com>> wrote:
>
>         Gabe is still testing/profiling, but we'll send something randomized
>         along eventually.
>
>
>         On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>> wrote:
>
>             I didn't see in the original thread a reproducible (simulated, I
>             guess) example, to be explicit about what the problem is??
>
>             Martin
>
>
>             On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>
>                 My understanding is that the heap optimization provided marginal
>                 gains, and
>                 that we need to think harder about how to optimize the all of
>                 the string
>                 manipulation in writeVcf. We either need to reduce it or reduce its
>                 overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>
>
>                 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
>                 <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                 wrote:
>
>                     Hi Gabe,
>
>                     Martin responded, and so did Michael,
>
>                     https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
>                     <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>
>
>                     It sounded like Michael was ok with working with/around heap
>                     initialization.
>
>                     Michael, is that right or should we still consider this on
>                     the table?
>
>
>                     Val
>
>
>                     On 08/26/2014 09:34 AM, Gabe Becker wrote:
>
>                         Val,
>
>                         Has there been any movement on this? This remains a
>                         substantial
>                         bottleneck for us when writing very large VCF files (e.g.
>                         variants+genotypes for whole genome NGS samples).
>
>                         I was able to see a ~25% speedup with 4 cores and  an
>                         "optimal" speedup
>                         of ~2x with 10-12 cores for a VCF with 500k rows  using
>                         a very naive
>                         parallelization strategy and no other changes. I suspect
>                         this could be
>                         improved on quite a bit, or possibly made irrelevant
>                         with judicious use
>                         of serial C code.
>
>                         Did you and Martin make any plans regarding optimizing
>                         writeVcf?
>
>                         Best
>                         ~G
>
>
>                         On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
>                         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>                         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>>
>                         wrote:
>
>                               Hi Michael,
>
>                               I'm interested in working on this. I'll discuss
>                         with Martin next
>                               week when we're both back in the office.
>
>                               Val
>
>
>
>
>
>                               On 08/05/14 07:46, Michael Lawrence wrote:
>
>                                   Hi guys (Val, Martin, Herve):
>
>                                   Anyone have an itch for optimization? The
>                         writeVcf function is
>                                   currently a
>                                   bottleneck in our WGS genotyping pipeline. For
>                         a typical 50
>                                   million row
>                                   gVCF, it was taking 2.25 hours prior to
>                         yesterday's improvements
>                                   (pasteCollapseRows) that brought it down to
>                         about 1 hour, which
>                                   is still
>                                   too long by my standards (> 0). Only takes 3
>                         minutes to call the
>                                   genotypes
>                                   (and associated likelihoods etc) from the
>                         variant calls (using
>                                   80 cores and
>                                   450 GB RAM on one node), so the output is an
>                         issue. Profiling
>                                   suggests that
>                                   the running time scales non-linearly in the
>                         number of rows.
>
>                                   Digging a little deeper, it seems to be
>                         something with R's
>                                   string/memory
>                                   allocation. Below, pasting 1 million strings
>                         takes 6 seconds, but
>                         10
>                                   million strings takes over 2 minutes. It gets
>                         way worse with 50
>                                   million. I
>                                   suspect it has something to do with R's string
>                         hash table.
>
>                                   set.seed(1000)
>                                   end <- sample(1e8, 1e6)
>                                   system.time(paste0("END", "=", end))
>                                        user  system elapsed
>                                       6.396   0.028   6.420
>
>                                   end <- sample(1e8, 1e7)
>                                   system.time(paste0("END", "=", end))
>                                        user  system elapsed
>                                   134.714   0.352 134.978
>
>                                   Indeed, even this takes a long time (in a
>                         fresh session):
>
>                                   set.seed(1000)
>                                   end <- sample(1e8, 1e6)
>                                   end <- sample(1e8, 1e7)
>                                   system.time(as.character(end))
>                                        user  system elapsed
>                                      57.224   0.156  57.366
>
>                                   But running it a second time is faster (about
>                         what one would
>                                   expect?):
>
>                                   system.time(levels <- as.character(end))
>                                        user  system elapsed
>                                      23.582   0.021  23.589
>
>                                   I did some simple profiling of R to find that
>                         the resizing of
>                                   the string
>                                   hash table is not a significant component of
>                         the time. So maybe
>                                   something
>                                   to do with the R heap/gc? No time right now to
>                         go deeper. But I
>                                   know Martin
>                                   likes this sort of thing ;)
>
>                                   Michael
>
>                                            [[alternative HTML version deleted]]
>
>
>                           ___________________________________________________
>                         Bioc-devel at r-project.org
>                         <mailto:Bioc-devel at r-project.org>
>                         <mailto:Bioc-devel at r-project.__org
>                         <mailto:Bioc-devel at r-project.org>>
>                                   mailing list
>                         https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>                         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>                           <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>                               ___________________________________________________
>                         Bioc-devel at r-project.org
>                         <mailto:Bioc-devel at r-project.org>
>                         <mailto:Bioc-devel at r-project.__org
>                         <mailto:Bioc-devel at r-project.org>> mailing
>                         list
>                         https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>                         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>
>                           <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
>                         --
>                         Computational Biologist
>                         Genentech Research
>
>
>
>
>
>                          [[alternative HTML version deleted]]
>
>                 _________________________________________________
>                 Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>                 mailing list
>                 https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>                 <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>             --
>             Computational Biology / Fred Hutchinson Cancer Research Center
>             1100 Fairview Ave. N.
>             PO Box 19024 Seattle, WA 98109
>
>             Location: Arnold Building M1 B861
>             Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
>
>
>
>     --
>     Computational Biologist
>     Genentech Research
>
>
>
>
> --
> Computational Biologist
> Genentech Research

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793