[Bioc-devel] writeVcf performance

Gabe Becker becker.gabe at gene.com
Mon Sep 8 17:43:15 CEST 2014


Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

> The new writeVcf code is in 1.11.28.
>
> Using the illumina file you suggested, geno fields only, writing now takes
> about 17 minutes.
>
> > hdr
> class: VCFHeader
> samples(1): NA12877
> meta(6): fileformat ApplyRecalibration ... reference source
> fixed(1): FILTER
> info(22): AC AF ... culprit set
> geno(8): GT GQX ... PL VF
>
> > param = ScanVcfParam(info=NA)
> > vcf = readVcf(fl, "", param=param)
> > dim(vcf)
> [1] 51612762        1
>
> > system.time(writeVcf(vcf, "out.vcf"))
>     user   system  elapsed
>  971.032    6.568 1004.593
>
> In 1.11.28, parsing of geno data was moved to C. If this didn't speed
> things up enough we were planning to implement 'chunking' through the VCF
> and/or move the parsing of info to C, however, it looks like geno was the
> bottleneck.
>
> I've tested a number of samples/fields combinations in files with >= .5
> million rows and the improvement over writeVcf() in release is ~ 90%.
>
> Valerie
>
>
>
>
> On 09/04/14 15:28, Valerie Obenchain wrote:
>
>> Thanks Gabe. I should have something for you on Monday.
>>
>> Val
>>
>>
>> On 09/04/2014 01:56 PM, Gabe Becker wrote:
>>
>>> Val and Martin,
>>>
>>> Apologies for the delay.
>>>
>>> We realized that the Illumina platinum genome vcf files make a good test
>>> case, assuming you strip out all the info (info=NA when reading it into
>>> R) stuff.
>>>
>>> ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
>>> took about ~4.2 hrs to write out, and is about 1.5x the size of the
>>> files we are actually dealing with (~50M ranges vs our ~30M).
>>>
>>> Looking forward a new vastly improved writeVcf :).
>>>
>>> ~G
>>>
>>>
>>> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
>>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>>>
>>>     Yes, it's very clear that the scaling is non-linear, and Gabe has
>>>     been experimenting with a chunk-wise + parallel algorithm.
>>>     Unfortunately there is some frustrating overhead with the
>>>     parallelism. But I'm glad Val is arriving at something quicker.
>>>
>>>     Michael
>>>
>>>
>>>     On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org
>>>     <mailto:mtmorgan at fhcrc.org>> wrote:
>>>
>>>         On 08/27/2014 11:56 AM, Gabe Becker wrote:
>>>
>>>             The profiling I attached in my previous email is for 24 geno
>>>             fields, as I said,
>>>             but our typical usecase involves only ~4-6 fields, and is
>>>             faster but still on
>>>             the order of dozens of minutes.
>>>
>>>
>>>         I think Val is arriving at a (much) more efficient
>>>         implementation, but...
>>>
>>>         I wanted to share my guess that the poor _scaling_ is because
>>>         the garbage collector runs multiple times as the different
>>>         strings are pasted together, and has to traverse, in linear
>>>         time, increasing numbers of allocated SEXPs. So times scale
>>>         approximately quadratically with the number of rows in the VCF
>>>
>>>         An efficiency is to reduce the number of SEXPs in play by
>>>         writing out in chunks -- as each chunk is written, the SEXPs
>>>         become available for collection and are re-used. Here's my toy
>>>         example
>>>
>>>         time.R
>>>         ======
>>>         splitIndices <- function (nx, ncl)
>>>         {
>>>              i <- seq_len(nx)
>>>              if (ncl == 0L)
>>>                  list()
>>>              else if (ncl == 1L || nx == 1L)
>>>                  list(i)
>>>              else {
>>>                  fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>>>                  breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>>>                  structure(split(i, cut(i, breaks, labels=FALSE)), names
>>>         = NULL)
>>>              }
>>>         }
>>>
>>>         x = as.character(seq_len(1e7)); y = sample(x)
>>>         if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
>>>              idx <- splitIndices(length(x), 20)
>>>              system.time(for (i in idx) paste(x[i], y[i], sep=":"))
>>>         } else {
>>>              system.time(paste(x, y, sep=":"))
>>>         }
>>>
>>>
>>>         running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
>>>         time.R the relevant time is
>>>
>>>             user  system elapsed
>>>           15.320   0.064  15.381
>>>
>>>         versus with $ R --no-save --quiet -f time.R it is
>>>
>>>             user  system elapsed
>>>           95.360   0.164  95.511
>>>
>>>         I think this is likely an overall strategy when dealing with
>>>         character data -- processing in independent chunks of moderate
>>>         (1M?) size (enabling as a consequence parallel evaluation in
>>>         modest memory) that are sufficient to benefit from
>>>         vectorization, but that do not entail allocation of large
>>>         numbers of in-use SEXPs.
>>>
>>>         Martin
>>>
>>>
>>>             Sorry for the confusion.
>>>             ~G
>>>
>>>
>>>             On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
>>>             <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
>>>             <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>>
>>> wrote:
>>>
>>>                  Martin and Val.
>>>
>>>                  I re-ran writeVcf on our (G)VCF data (34790518 ranges,
>>>             24 geno fields) with
>>>                  profiling enabled. The results of summaryRprof for that
>>>             run are attached,
>>>                  though for a variety of reasons they are pretty
>>> misleading.
>>>
>>>                  It took over an hour to write (3700+seconds), so it's
>>>             definitely a
>>>                  bottleneck when the data get very large, even if it
>>>             isn't for smaller data.
>>>
>>>                  Michael and I both think the culprit is all the pasting
>>>             and cbinding that is
>>>                  going on, and more to the point, that memory for an
>>>             internal representation
>>>                  to be written out is allocated at all.  Streaming
>>>             across the object, looping
>>>                  by rows and writing directly to file (e.g. from C)
>>>             should be blisteringly
>>>                  fast in comparison.
>>>
>>>                  ~G
>>>
>>>
>>>                  On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
>>>             <michafla at gene.com <mailto:michafla at gene.com>
>>>                  <mailto:michafla at gene.com <mailto:michafla at gene.com>>>
>>>             wrote:
>>>
>>>                      Gabe is still testing/profiling, but we'll send
>>>             something randomized
>>>                      along eventually.
>>>
>>>
>>>                      On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
>>>             <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>>>                      <mailto:mtmorgan at fhcrc.org
>>>             <mailto:mtmorgan at fhcrc.org>>> wrote:
>>>
>>>                          I didn't see in the original thread a
>>>             reproducible (simulated, I
>>>                          guess) example, to be explicit about what the
>>>             problem is??
>>>
>>>                          Martin
>>>
>>>
>>>                          On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>>
>>>                              My understanding is that the heap
>>>             optimization provided marginal
>>>                              gains, and
>>>                              that we need to think harder about how to
>>>             optimize the all of
>>>                              the string
>>>                              manipulation in writeVcf. We either need to
>>>             reduce it or reduce its
>>>                              overhead (i.e., the CHARSXP allocation).
>>>             Gabe is doing more tests.
>>>
>>>
>>>                              On Tue, Aug 26, 2014 at 9:43 AM, Valerie
>>>             Obenchain
>>>                              <vobencha at fhcrc.org
>>>             <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>>>             <mailto:vobencha at fhcrc.org>>>
>>>
>>>                              wrote:
>>>
>>>                                  Hi Gabe,
>>>
>>>                                  Martin responded, and so did Michael,
>>>
>>>
>>> https://stat.ethz.ch/____pipermail/bioc-devel/2014-____
>>> August/006082.html
>>>
>>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
>>>
>>>
>>>
>>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
>>>
>>> <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>
>>>
>>>                                  It sounded like Michael was ok with
>>>             working with/around heap
>>>                                  initialization.
>>>
>>>                                  Michael, is that right or should we
>>>             still consider this on
>>>                                  the table?
>>>
>>>
>>>                                  Val
>>>
>>>
>>>                                  On 08/26/2014 09:34 AM, Gabe Becker
>>> wrote:
>>>
>>>                                      Val,
>>>
>>>                                      Has there been any movement on
>>>             this? This remains a
>>>                                      substantial
>>>                                      bottleneck for us when writing very
>>>             large VCF files (e.g.
>>>                                      variants+genotypes for whole genome
>>>             NGS samples).
>>>
>>>                                      I was able to see a ~25% speedup
>>>             with 4 cores and  an
>>>                                      "optimal" speedup
>>>                                      of ~2x with 10-12 cores for a VCF
>>>             with 500k rows  using
>>>                                      a very naive
>>>                                      parallelization strategy and no
>>>             other changes. I suspect
>>>                                      this could be
>>>                                      improved on quite a bit, or
>>>             possibly made irrelevant
>>>                                      with judicious use
>>>                                      of serial C code.
>>>
>>>                                      Did you and Martin make any plans
>>>             regarding optimizing
>>>                                      writeVcf?
>>>
>>>                                      Best
>>>                                      ~G
>>>
>>>
>>>                                      On Tue, Aug 5, 2014 at 2:33 PM,
>>>             Valerie Obenchain
>>>                                      <vobencha at fhcrc.org
>>>             <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>>>             <mailto:vobencha at fhcrc.org>>
>>>                                      <mailto:vobencha at fhcrc.org
>>>             <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>>>             <mailto:vobencha at fhcrc.org>>>>
>>>
>>>                                      wrote:
>>>
>>>                                            Hi Michael,
>>>
>>>                                            I'm interested in working on
>>>             this. I'll discuss
>>>                                      with Martin next
>>>                                            week when we're both back in
>>>             the office.
>>>
>>>                                            Val
>>>
>>>
>>>
>>>
>>>
>>>                                            On 08/05/14 07:46, Michael
>>>             Lawrence wrote:
>>>
>>>                                                Hi guys (Val, Martin,
>>> Herve):
>>>
>>>                                                Anyone have an itch for
>>>             optimization? The
>>>                                      writeVcf function is
>>>                                                currently a
>>>                                                bottleneck in our WGS
>>>             genotyping pipeline. For
>>>                                      a typical 50
>>>                                                million row
>>>                                                gVCF, it was taking 2.25
>>>             hours prior to
>>>                                      yesterday's improvements
>>>                                                (pasteCollapseRows) that
>>>             brought it down to
>>>                                      about 1 hour, which
>>>                                                is still
>>>                                                too long by my standards
>>>             (> 0). Only takes 3
>>>                                      minutes to call the
>>>                                                genotypes
>>>                                                (and associated
>>>             likelihoods etc) from the
>>>                                      variant calls (using
>>>                                                80 cores and
>>>                                                450 GB RAM on one node),
>>>             so the output is an
>>>                                      issue. Profiling
>>>                                                suggests that
>>>                                                the running time scales
>>>             non-linearly in the
>>>                                      number of rows.
>>>
>>>                                                Digging a little deeper,
>>>             it seems to be
>>>                                      something with R's
>>>                                                string/memory
>>>                                                allocation. Below,
>>>             pasting 1 million strings
>>>                                      takes 6 seconds, but
>>>                                      10
>>>                                                million strings takes
>>>             over 2 minutes. It gets
>>>                                      way worse with 50
>>>                                                million. I
>>>                                                suspect it has something
>>>             to do with R's string
>>>                                      hash table.
>>>
>>>                                                set.seed(1000)
>>>                                                end <- sample(1e8, 1e6)
>>>                                                system.time(paste0("END",
>>>             "=", end))
>>>                                                     user  system elapsed
>>>                                                    6.396   0.028   6.420
>>>
>>>                                                end <- sample(1e8, 1e7)
>>>                                                system.time(paste0("END",
>>>             "=", end))
>>>                                                     user  system elapsed
>>>                                                134.714   0.352 134.978
>>>
>>>                                                Indeed, even this takes a
>>>             long time (in a
>>>                                      fresh session):
>>>
>>>                                                set.seed(1000)
>>>                                                end <- sample(1e8, 1e6)
>>>                                                end <- sample(1e8, 1e7)
>>>
>>>             system.time(as.character(end))
>>>                                                     user  system elapsed
>>>                                                   57.224   0.156  57.366
>>>
>>>                                                But running it a second
>>>             time is faster (about
>>>                                      what one would
>>>                                                expect?):
>>>
>>>                                                system.time(levels <-
>>>             as.character(end))
>>>                                                     user  system elapsed
>>>                                                   23.582   0.021  23.589
>>>
>>>                                                I did some simple
>>>             profiling of R to find that
>>>                                      the resizing of
>>>                                                the string
>>>                                                hash table is not a
>>>             significant component of
>>>                                      the time. So maybe
>>>                                                something
>>>                                                to do with the R heap/gc?
>>>             No time right now to
>>>                                      go deeper. But I
>>>                                                know Martin
>>>                                                likes this sort of
>>> thing ;)
>>>
>>>                                                Michael
>>>
>>>                                                         [[alternative
>>>             HTML version deleted]]
>>>
>>>
>>>
>>>             _____________________________________________________
>>>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>                                      <mailto:Bioc-devel at r-project.__org
>>>             <mailto:Bioc-devel at r-project.org>>
>>>                                      <mailto:Bioc-devel at r-project.
>>>             <mailto:Bioc-devel at r-project.>____org
>>>                                      <mailto:Bioc-devel at r-project.__org
>>>             <mailto:Bioc-devel at r-project.org>>>
>>>                                                mailing list
>>>
>>>             https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>>
>>>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>>>
>>>
>>>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>
>>>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>>>
>>>
>>>
>>>             _____________________________________________________
>>>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>                                      <mailto:Bioc-devel at r-project.__org
>>>             <mailto:Bioc-devel at r-project.org>>
>>>                                      <mailto:Bioc-devel at r-project.
>>>             <mailto:Bioc-devel at r-project.>____org
>>>                                      <mailto:Bioc-devel at r-project.__org
>>>             <mailto:Bioc-devel at r-project.org>>> mailing
>>>                                      list
>>>             https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>>
>>>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>>>
>>>
>>>
>>>
>>>             <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>
>>>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>>>
>>>
>>>
>>>
>>>                                      --
>>>                                      Computational Biologist
>>>                                      Genentech Research
>>>
>>>
>>>
>>>
>>>
>>>                                       [[alternative HTML version
>>> deleted]]
>>>
>>>
>>>             ___________________________________________________
>>>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>             <mailto:Bioc-devel at r-project.__org
>>>             <mailto:Bioc-devel at r-project.org>>
>>>                              mailing list
>>>             https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>
>>>             <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>             <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>
>>>
>>>
>>>                          --
>>>                          Computational Biology / Fred Hutchinson Cancer
>>>             Research Center
>>>                          1100 Fairview Ave. N.
>>>                          PO Box 19024 Seattle, WA 98109
>>>
>>>                          Location: Arnold Building M1 B861
>>>                          Phone: (206) 667-2793
>>>             <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
>>>
>>>
>>>
>>>
>>>
>>>
>>>                  --
>>>                  Computational Biologist
>>>                  Genentech Research
>>>
>>>
>>>
>>>
>>>             --
>>>             Computational Biologist
>>>             Genentech Research
>>>
>>>
>>>
>>>         --
>>>         Computational Biology / Fred Hutchinson Cancer Research Center
>>>         1100 Fairview Ave. N.
>>>         PO Box 19024 Seattle, WA 98109
>>>
>>>         Location: Arnold Building M1 B861
>>>         Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>


-- 
Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list