[Bioc-devel] writeVcf performance

Fri Sep 5 00:28:04 CEST 2014

Thanks Gabe. I should have something for you on Monday.

Val

On 09/04/2014 01:56 PM, Gabe Becker wrote:
> Val and Martin,
>
> Apologies for the delay.
>
> We realized that the Illumina platinum genome vcf files make a good test
> case, assuming you strip out all the info (info=NA when reading it into
> R) stuff.
>
> ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
> took about ~4.2 hrs to write out, and is about 1.5x the size of the
> files we are actually dealing with (~50M ranges vs our ~30M).
>
> Looking forward a new vastly improved writeVcf :).
>
> ~G
>
>
> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>
>     Yes, it's very clear that the scaling is non-linear, and Gabe has
>     been experimenting with a chunk-wise + parallel algorithm.
>     Unfortunately there is some frustrating overhead with the
>     parallelism. But I'm glad Val is arriving at something quicker.
>
>     Michael
>
>
>     On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org
>     <mailto:mtmorgan at fhcrc.org>> wrote:
>
>         On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
>             The profiling I attached in my previous email is for 24 geno
>             fields, as I said,
>             but our typical usecase involves only ~4-6 fields, and is
>             faster but still on
>             the order of dozens of minutes.
>
>
>         I think Val is arriving at a (much) more efficient
>         implementation, but...
>
>         I wanted to share my guess that the poor _scaling_ is because
>         the garbage collector runs multiple times as the different
>         strings are pasted together, and has to traverse, in linear
>         time, increasing numbers of allocated SEXPs. So times scale
>         approximately quadratically with the number of rows in the VCF
>
>         An efficiency is to reduce the number of SEXPs in play by
>         writing out in chunks -- as each chunk is written, the SEXPs
>         become available for collection and are re-used. Here's my toy
>         example
>
>         time.R
>         ======
>         splitIndices <- function (nx, ncl)
>         {
>              i <- seq_len(nx)
>              if (ncl == 0L)
>                  list()
>              else if (ncl == 1L || nx == 1L)
>                  list(i)
>              else {
>                  fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>                  breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>                  structure(split(i, cut(i, breaks, labels=FALSE)), names
>         = NULL)
>              }
>         }
>
>         x = as.character(seq_len(1e7)); y = sample(x)
>         if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
>              idx <- splitIndices(length(x), 20)
>              system.time(for (i in idx) paste(x[i], y[i], sep=":"))
>         } else {
>              system.time(paste(x, y, sep=":"))
>         }
>
>
>         running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
>         time.R the relevant time is
>
>             user  system elapsed
>           15.320   0.064  15.381
>
>         versus with $ R --no-save --quiet -f time.R it is
>
>             user  system elapsed
>           95.360   0.164  95.511
>
>         I think this is likely an overall strategy when dealing with
>         character data -- processing in independent chunks of moderate
>         (1M?) size (enabling as a consequence parallel evaluation in
>         modest memory) that are sufficient to benefit from
>         vectorization, but that do not entail allocation of large
>         numbers of in-use SEXPs.
>
>         Martin
>
>
>             Sorry for the confusion.
>             ~G
>
>
>             On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
>             <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
>             <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>> wrote:
>
>                  Martin and Val.
>
>                  I re-ran writeVcf on our (G)VCF data (34790518 ranges,
>             24 geno fields) with
>                  profiling enabled. The results of summaryRprof for that
>             run are attached,
>                  though for a variety of reasons they are pretty misleading.
>
>                  It took over an hour to write (3700+seconds), so it's
>             definitely a
>                  bottleneck when the data get very large, even if it
>             isn't for smaller data.
>
>                  Michael and I both think the culprit is all the pasting
>             and cbinding that is
>                  going on, and more to the point, that memory for an
>             internal representation
>                  to be written out is allocated at all.  Streaming
>             across the object, looping
>                  by rows and writing directly to file (e.g. from C)
>             should be blisteringly
>                  fast in comparison.
>
>                  ~G
>
>
>                  On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
>             <michafla at gene.com <mailto:michafla at gene.com>
>                  <mailto:michafla at gene.com <mailto:michafla at gene.com>>>
>             wrote:
>
>                      Gabe is still testing/profiling, but we'll send
>             something randomized
>                      along eventually.
>
>
>                      On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
>             <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>                      <mailto:mtmorgan at fhcrc.org
>             <mailto:mtmorgan at fhcrc.org>>> wrote:
>
>                          I didn't see in the original thread a
>             reproducible (simulated, I
>                          guess) example, to be explicit about what the
>             problem is??
>
>                          Martin
>
>
>                          On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>
>                              My understanding is that the heap
>             optimization provided marginal
>                              gains, and
>                              that we need to think harder about how to
>             optimize the all of
>                              the string
>                              manipulation in writeVcf. We either need to
>             reduce it or reduce its
>                              overhead (i.e., the CHARSXP allocation).
>             Gabe is doing more tests.
>
>
>                              On Tue, Aug 26, 2014 at 9:43 AM, Valerie
>             Obenchain
>                              <vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>>>
>
>                              wrote:
>
>                                  Hi Gabe,
>
>                                  Martin responded, and so did Michael,
>
>             https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
>             <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
>
>
>             <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
>             <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>
>
>                                  It sounded like Michael was ok with
>             working with/around heap
>                                  initialization.
>
>                                  Michael, is that right or should we
>             still consider this on
>                                  the table?
>
>
>                                  Val
>
>
>                                  On 08/26/2014 09:34 AM, Gabe Becker wrote:
>
>                                      Val,
>
>                                      Has there been any movement on
>             this? This remains a
>                                      substantial
>                                      bottleneck for us when writing very
>             large VCF files (e.g.
>                                      variants+genotypes for whole genome
>             NGS samples).
>
>                                      I was able to see a ~25% speedup
>             with 4 cores and  an
>                                      "optimal" speedup
>                                      of ~2x with 10-12 cores for a VCF
>             with 500k rows  using
>                                      a very naive
>                                      parallelization strategy and no
>             other changes. I suspect
>                                      this could be
>                                      improved on quite a bit, or
>             possibly made irrelevant
>                                      with judicious use
>                                      of serial C code.
>
>                                      Did you and Martin make any plans
>             regarding optimizing
>                                      writeVcf?
>
>                                      Best
>                                      ~G
>
>
>                                      On Tue, Aug 5, 2014 at 2:33 PM,
>             Valerie Obenchain
>                                      <vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>>
>                                      <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>>>>
>
>                                      wrote:
>
>                                            Hi Michael,
>
>                                            I'm interested in working on
>             this. I'll discuss
>                                      with Martin next
>                                            week when we're both back in
>             the office.
>
>                                            Val
>
>
>
>
>
>                                            On 08/05/14 07:46, Michael
>             Lawrence wrote:
>
>                                                Hi guys (Val, Martin, Herve):
>
>                                                Anyone have an itch for
>             optimization? The
>                                      writeVcf function is
>                                                currently a
>                                                bottleneck in our WGS
>             genotyping pipeline. For
>                                      a typical 50
>                                                million row
>                                                gVCF, it was taking 2.25
>             hours prior to
>                                      yesterday's improvements
>                                                (pasteCollapseRows) that
>             brought it down to
>                                      about 1 hour, which
>                                                is still
>                                                too long by my standards
>             (> 0). Only takes 3
>                                      minutes to call the
>                                                genotypes
>                                                (and associated
>             likelihoods etc) from the
>                                      variant calls (using
>                                                80 cores and
>                                                450 GB RAM on one node),
>             so the output is an
>                                      issue. Profiling
>                                                suggests that
>                                                the running time scales
>             non-linearly in the
>                                      number of rows.
>
>                                                Digging a little deeper,
>             it seems to be
>                                      something with R's
>                                                string/memory
>                                                allocation. Below,
>             pasting 1 million strings
>                                      takes 6 seconds, but
>                                      10
>                                                million strings takes
>             over 2 minutes. It gets
>                                      way worse with 50
>                                                million. I
>                                                suspect it has something
>             to do with R's string
>                                      hash table.
>
>                                                set.seed(1000)
>                                                end <- sample(1e8, 1e6)
>                                                system.time(paste0("END",
>             "=", end))
>                                                     user  system elapsed
>                                                    6.396   0.028   6.420
>
>                                                end <- sample(1e8, 1e7)
>                                                system.time(paste0("END",
>             "=", end))
>                                                     user  system elapsed
>                                                134.714   0.352 134.978
>
>                                                Indeed, even this takes a
>             long time (in a
>                                      fresh session):
>
>                                                set.seed(1000)
>                                                end <- sample(1e8, 1e6)
>                                                end <- sample(1e8, 1e7)
>
>             system.time(as.character(end))
>                                                     user  system elapsed
>                                                   57.224   0.156  57.366
>
>                                                But running it a second
>             time is faster (about
>                                      what one would
>                                                expect?):
>
>                                                system.time(levels <-
>             as.character(end))
>                                                     user  system elapsed
>                                                   23.582   0.021  23.589
>
>                                                I did some simple
>             profiling of R to find that
>                                      the resizing of
>                                                the string
>                                                hash table is not a
>             significant component of
>                                      the time. So maybe
>                                                something
>                                                to do with the R heap/gc?
>             No time right now to
>                                      go deeper. But I
>                                                know Martin
>                                                likes this sort of thing ;)
>
>                                                Michael
>
>                                                         [[alternative
>             HTML version deleted]]
>
>
>
>             _____________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>                                      <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>
>                                      <mailto:Bioc-devel at r-project.
>             <mailto:Bioc-devel at r-project.>____org
>                                      <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>>
>                                                mailing list
>
>             https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>
>
>
>             _____________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>                                      <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>
>                                      <mailto:Bioc-devel at r-project.
>             <mailto:Bioc-devel at r-project.>____org
>                                      <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>> mailing
>                                      list
>             https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>
>
>
>
>                                      --
>                                      Computational Biologist
>                                      Genentech Research
>
>
>
>
>
>                                       [[alternative HTML version deleted]]
>
>
>             ___________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>
>                              mailing list
>             https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>                          --
>                          Computational Biology / Fred Hutchinson Cancer
>             Research Center
>                          1100 Fairview Ave. N.
>                          PO Box 19024 Seattle, WA 98109
>
>                          Location: Arnold Building M1 B861
>                          Phone: (206) 667-2793
>             <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
>
>
>
>
>
>
>                  --
>                  Computational Biologist
>                  Genentech Research
>
>
>
>
>             --
>             Computational Biologist
>             Genentech Research
>
>
>
>         --
>         Computational Biology / Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N.
>         PO Box 19024 Seattle, WA 98109
>
>         Location: Arnold Building M1 B861
>         Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
>
>
>
> --
> Computational Biologist
> Genentech Research