[Bioc-devel] writeVcf performance

Mon Sep 8 21:00:37 CEST 2014

fyi Martin found a bug with the treatment of list data (ie, Number = 
'.') in the header. Working on a fix ...

Val

On 09/08/2014 08:43 AM, Gabe Becker wrote:
> Val,
>
> That is great. I'll check this out and test it on our end.
>
> ~G
>
> On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     The new writeVcf code is in 1.11.28.
>
>     Using the illumina file you suggested, geno fields only, writing now
>     takes about 17 minutes.
>
>      > hdr
>     class: VCFHeader
>     samples(1): NA12877
>     meta(6): fileformat ApplyRecalibration ... reference source
>     fixed(1): FILTER
>     info(22): AC AF ... culprit set
>     geno(8): GT GQX ... PL VF
>
>      > param = ScanVcfParam(info=NA)
>      > vcf = readVcf(fl, "", param=param)
>      > dim(vcf)
>     [1] 51612762        1
>
>      > system.time(writeVcf(vcf, "out.vcf"))
>          user   system  elapsed
>       971.032    6.568 1004.593
>
>     In 1.11.28, parsing of geno data was moved to C. If this didn't
>     speed things up enough we were planning to implement 'chunking'
>     through the VCF and/or move the parsing of info to C, however, it
>     looks like geno was the bottleneck.
>
>     I've tested a number of samples/fields combinations in files with >=
>     .5 million rows and the improvement over writeVcf() in release is ~ 90%.
>
>     Valerie
>
>
>
>
>     On 09/04/14 15:28, Valerie Obenchain wrote:
>
>         Thanks Gabe. I should have something for you on Monday.
>
>         Val
>
>
>         On 09/04/2014 01:56 PM, Gabe Becker wrote:
>
>             Val and Martin,
>
>             Apologies for the delay.
>
>             We realized that the Illumina platinum genome vcf files make
>             a good test
>             case, assuming you strip out all the info (info=NA when
>             reading it into
>             R) stuff.
>
>             ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz
>             <ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
>             took about ~4.2 hrs to write out, and is about 1.5x the size
>             of the
>             files we are actually dealing with (~50M ranges vs our ~30M).
>
>             Looking forward a new vastly improved writeVcf :).
>
>             ~G
>
>
>             On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
>             <lawrence.michael at gene.com
>             <mailto:lawrence.michael at gene.com>
>             <mailto:lawrence.michael at gene.__com
>             <mailto:lawrence.michael at gene.com>>> wrote:
>
>                  Yes, it's very clear that the scaling is non-linear,
>             and Gabe has
>                  been experimenting with a chunk-wise + parallel algorithm.
>                  Unfortunately there is some frustrating overhead with the
>                  parallelism. But I'm glad Val is arriving at something
>             quicker.
>
>                  Michael
>
>
>                  On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
>             <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>                  <mailto:mtmorgan at fhcrc.org
>             <mailto:mtmorgan at fhcrc.org>>> wrote:
>
>                      On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
>                          The profiling I attached in my previous email
>             is for 24 geno
>                          fields, as I said,
>                          but our typical usecase involves only ~4-6
>             fields, and is
>                          faster but still on
>                          the order of dozens of minutes.
>
>
>                      I think Val is arriving at a (much) more efficient
>                      implementation, but...
>
>                      I wanted to share my guess that the poor _scaling_
>             is because
>                      the garbage collector runs multiple times as the
>             different
>                      strings are pasted together, and has to traverse,
>             in linear
>                      time, increasing numbers of allocated SEXPs. So
>             times scale
>                      approximately quadratically with the number of rows
>             in the VCF
>
>                      An efficiency is to reduce the number of SEXPs in
>             play by
>                      writing out in chunks -- as each chunk is written,
>             the SEXPs
>                      become available for collection and are re-used.
>             Here's my toy
>                      example
>
>                      time.R
>                      ======
>                      splitIndices <- function (nx, ncl)
>                      {
>                           i <- seq_len(nx)
>                           if (ncl == 0L)
>                               list()
>                           else if (ncl == 1L || nx == 1L)
>                               list(i)
>                           else {
>                               fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>                               breaks <- seq(1 - fuzz, nx + fuzz, length
>             = ncl + 1L)
>                               structure(split(i, cut(i, breaks,
>             labels=FALSE)), names
>                      = NULL)
>                           }
>                      }
>
>                      x = as.character(seq_len(1e7)); y = sample(x)
>                      if (!is.na <http://is.na>
>             <http://is.na>(Sys.getenv("__SPLIT", NA))) {
>                           idx <- splitIndices(length(x), 20)
>                           system.time(for (i in idx) paste(x[i], y[i],
>             sep=":"))
>                      } else {
>                           system.time(paste(x, y, sep=":"))
>                      }
>
>
>                      running under R-devel with $ SPLIT=TRUE R --no-save
>             --quiet -f
>                      time.R the relevant time is
>
>                          user  system elapsed
>                        15.320   0.064  15.381
>
>                      versus with $ R --no-save --quiet -f time.R it is
>
>                          user  system elapsed
>                        95.360   0.164  95.511
>
>                      I think this is likely an overall strategy when
>             dealing with
>                      character data -- processing in independent chunks
>             of moderate
>                      (1M?) size (enabling as a consequence parallel
>             evaluation in
>                      modest memory) that are sufficient to benefit from
>                      vectorization, but that do not entail allocation of
>             large
>                      numbers of in-use SEXPs.
>
>                      Martin
>
>
>                          Sorry for the confusion.
>                          ~G
>
>
>                          On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
>                          <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
>             <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
>                          <mailto:beckerg4 at gene.com
>             <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
>             <mailto:beckerg4 at gene.com>>>> wrote:
>
>                               Martin and Val.
>
>                               I re-ran writeVcf on our (G)VCF data
>             (34790518 ranges,
>                          24 geno fields) with
>                               profiling enabled. The results of
>             summaryRprof for that
>                          run are attached,
>                               though for a variety of reasons they are
>             pretty
>             misleading.
>
>                               It took over an hour to write
>             (3700+seconds), so it's
>                          definitely a
>                               bottleneck when the data get very large,
>             even if it
>                          isn't for smaller data.
>
>                               Michael and I both think the culprit is
>             all the pasting
>                          and cbinding that is
>                               going on, and more to the point, that
>             memory for an
>                          internal representation
>                               to be written out is allocated at all.
>             Streaming
>                          across the object, looping
>                               by rows and writing directly to file (e.g.
>             from C)
>                          should be blisteringly
>                               fast in comparison.
>
>                               ~G
>
>
>                               On Tue, Aug 26, 2014 at 11:57 AM, Michael
>             Lawrence
>                          <michafla at gene.com <mailto:michafla at gene.com>
>             <mailto:michafla at gene.com <mailto:michafla at gene.com>>
>                               <mailto:michafla at gene.com
>             <mailto:michafla at gene.com> <mailto:michafla at gene.com
>             <mailto:michafla at gene.com>>>>
>                          wrote:
>
>                                   Gabe is still testing/profiling, but
>             we'll send
>                          something randomized
>                                   along eventually.
>
>
>                                   On Tue, Aug 26, 2014 at 11:15 AM,
>             Martin Morgan
>                          <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>             <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>                                   <mailto:mtmorgan at fhcrc.org
>             <mailto:mtmorgan at fhcrc.org>
>                          <mailto:mtmorgan at fhcrc.org
>             <mailto:mtmorgan at fhcrc.org>>>> wrote:
>
>                                       I didn't see in the original thread a
>                          reproducible (simulated, I
>                                       guess) example, to be explicit
>             about what the
>                          problem is??
>
>                                       Martin
>
>
>                                       On 08/26/2014 10:47 AM, Michael
>             Lawrence wrote:
>
>                                           My understanding is that the heap
>                          optimization provided marginal
>                                           gains, and
>                                           that we need to think harder
>             about how to
>                          optimize the all of
>                                           the string
>                                           manipulation in writeVcf. We
>             either need to
>                          reduce it or reduce its
>                                           overhead (i.e., the CHARSXP
>             allocation).
>                          Gabe is doing more tests.
>
>
>                                           On Tue, Aug 26, 2014 at 9:43
>             AM, Valerie
>                          Obenchain
>                                           <vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>
>                          <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>
>                          <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>>>>
>
>                                           wrote:
>
>                                               Hi Gabe,
>
>                                               Martin responded, and so
>             did Michael,
>
>
>             https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
>             <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>
>
>             <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
>             <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>
>
>
>
>             <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
>             <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
>
>             <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
>             <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>
>
>                                               It sounded like Michael
>             was ok with
>                          working with/around heap
>                                               initialization.
>
>                                               Michael, is that right or
>             should we
>                          still consider this on
>                                               the table?
>
>
>                                               Val
>
>
>                                               On 08/26/2014 09:34 AM,
>             Gabe Becker
>             wrote:
>
>                                                   Val,
>
>                                                   Has there been any
>             movement on
>                          this? This remains a
>                                                   substantial
>                                                   bottleneck for us when
>             writing very
>                          large VCF files (e.g.
>                                                   variants+genotypes for
>             whole genome
>                          NGS samples).
>
>                                                   I was able to see a
>             ~25% speedup
>                          with 4 cores and  an
>                                                   "optimal" speedup
>                                                   of ~2x with 10-12
>             cores for a VCF
>                          with 500k rows  using
>                                                   a very naive
>                                                   parallelization
>             strategy and no
>                          other changes. I suspect
>                                                   this could be
>                                                   improved on quite a
>             bit, or
>                          possibly made irrelevant
>                                                   with judicious use
>                                                   of serial C code.
>
>                                                   Did you and Martin
>             make any plans
>                          regarding optimizing
>                                                   writeVcf?
>
>                                                   Best
>                                                   ~G
>
>
>                                                   On Tue, Aug 5, 2014 at
>             2:33 PM,
>                          Valerie Obenchain
>                                                   <vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>
>                          <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>
>                          <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>>>
>
>               <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>                          <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>
>                          <mailto:vobencha at fhcrc.org
>             <mailto:vobencha at fhcrc.org>>>>>
>
>                                                   wrote:
>
>                                                         Hi Michael,
>
>                                                         I'm interested
>             in working on
>                          this. I'll discuss
>                                                   with Martin next
>                                                         week when we're
>             both back in
>                          the office.
>
>                                                         Val
>
>
>
>
>
>                                                         On 08/05/14
>             07:46, Michael
>                          Lawrence wrote:
>
>                                                             Hi guys
>             (Val, Martin,
>             Herve):
>
>                                                             Anyone have
>             an itch for
>                          optimization? The
>                                                   writeVcf function is
>                                                             currently a
>                                                             bottleneck
>             in our WGS
>                          genotyping pipeline. For
>                                                   a typical 50
>                                                             million row
>                                                             gVCF, it was
>             taking 2.25
>                          hours prior to
>                                                   yesterday's improvements
>
>               (pasteCollapseRows) that
>                          brought it down to
>                                                   about 1 hour, which
>                                                             is still
>                                                             too long by
>             my standards
>                          (> 0). Only takes 3
>                                                   minutes to call the
>                                                             genotypes
>                                                             (and associated
>                          likelihoods etc) from the
>                                                   variant calls (using
>                                                             80 cores and
>                                                             450 GB RAM
>             on one node),
>                          so the output is an
>                                                   issue. Profiling
>                                                             suggests that
>                                                             the running
>             time scales
>                          non-linearly in the
>                                                   number of rows.
>
>                                                             Digging a
>             little deeper,
>                          it seems to be
>                                                   something with R's
>                                                             string/memory
>                                                             allocation.
>             Below,
>                          pasting 1 million strings
>                                                   takes 6 seconds, but
>                                                   10
>                                                             million
>             strings takes
>                          over 2 minutes. It gets
>                                                   way worse with 50
>                                                             million. I
>                                                             suspect it
>             has something
>                          to do with R's string
>                                                   hash table.
>
>                                                             set.seed(1000)
>                                                             end <-
>             sample(1e8, 1e6)
>
>               system.time(paste0("END",
>                          "=", end))
>                                                                  user
>             system elapsed
>                                                                 6.396
>               0.028   6.420
>
>                                                             end <-
>             sample(1e8, 1e7)
>
>               system.time(paste0("END",
>                          "=", end))
>                                                                  user
>             system elapsed
>                                                             134.714
>               0.352 134.978
>
>                                                             Indeed, even
>             this takes a
>                          long time (in a
>                                                   fresh session):
>
>                                                             set.seed(1000)
>                                                             end <-
>             sample(1e8, 1e6)
>                                                             end <-
>             sample(1e8, 1e7)
>
>                          system.time(as.character(end))
>                                                                  user
>             system elapsed
>                                                                57.224
>               0.156  57.366
>
>                                                             But running
>             it a second
>                          time is faster (about
>                                                   what one would
>                                                             expect?):
>
>
>               system.time(levels <-
>                          as.character(end))
>                                                                  user
>             system elapsed
>                                                                23.582
>               0.021  23.589
>
>                                                             I did some
>             simple
>                          profiling of R to find that
>                                                   the resizing of
>                                                             the string
>                                                             hash table
>             is not a
>                          significant component of
>                                                   the time. So maybe
>                                                             something
>                                                             to do with
>             the R heap/gc?
>                          No time right now to
>                                                   go deeper. But I
>                                                             know Martin
>                                                             likes this
>             sort of
>             thing ;)
>
>                                                             Michael
>
>
>             [[alternative
>                          HTML version deleted]]
>
>
>
>
>             _______________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>
>
>               <mailto:Bioc-devel at r-project.
>             <mailto:Bioc-devel at r-project.>____org
>                          <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>>
>
>               <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
>                          <mailto:Bioc-devel at r-project
>             <mailto:Bioc-devel at r-project>.>______org
>
>               <mailto:Bioc-devel at r-project.
>             <mailto:Bioc-devel at r-project.>____org
>                          <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>>>
>                                                             mailing list
>
>             https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
>             <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>
>
>
>             <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>
>
>
>
>
>             _______________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>
>
>               <mailto:Bioc-devel at r-project.
>             <mailto:Bioc-devel at r-project.>____org
>                          <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>>
>
>               <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
>                          <mailto:Bioc-devel at r-project
>             <mailto:Bioc-devel at r-project>.>______org
>
>               <mailto:Bioc-devel at r-project.
>             <mailto:Bioc-devel at r-project.>____org
>                          <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>>> mailing
>                                                   list
>             https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
>             <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>
>
>
>
>
>             <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>
>
>
>
>
>                                                   --
>                                                   Computational Biologist
>                                                   Genentech Research
>
>
>
>
>
>                                                    [[alternative HTML
>             version
>             deleted]]
>
>
>
>             _____________________________________________________
>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>             <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>
>                          <mailto:Bioc-devel at r-project.
>             <mailto:Bioc-devel at r-project.>____org
>                          <mailto:Bioc-devel at r-project.__org
>             <mailto:Bioc-devel at r-project.org>>>
>                                           mailing list
>             https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>
>
>
>                                       --
>                                       Computational Biology / Fred
>             Hutchinson Cancer
>                          Research Center
>                                       1100 Fairview Ave. N.
>                                       PO Box 19024 Seattle, WA 98109
>
>                                       Location: Arnold Building M1 B861
>                                       Phone: (206) 667-2793
>             <tel:%28206%29%20667-2793>
>                          <tel:%28206%29%20667-2793>
>             <tel:%28206%29%20667-2793>
>
>
>
>
>
>
>                               --
>                               Computational Biologist
>                               Genentech Research
>
>
>
>
>                          --
>                          Computational Biologist
>                          Genentech Research
>
>
>
>                      --
>                      Computational Biology / Fred Hutchinson Cancer
>             Research Center
>                      1100 Fairview Ave. N.
>                      PO Box 19024 Seattle, WA 98109
>
>                      Location: Arnold Building M1 B861
>                      Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>             <tel:%28206%29%20667-2793>
>
>
>
>
>
>             --
>             Computational Biologist
>             Genentech Research
>
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
> --
> Computational Biologist
> Genentech Research