[Bioc-devel] writeVcf performance

Thu Sep 4 22:56:25 CEST 2014

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into R)
stuff.

ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took
about ~4.2 hrs to write out, and is about 1.5x the size of the files we are
actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G

On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence <lawrence.michael at gene.com>
wrote:

> Yes, it's very clear that the scaling is non-linear, and Gabe has been
> experimenting with a chunk-wise + parallel algorithm. Unfortunately there
> is some frustrating overhead with the parallelism. But I'm glad Val is
> arriving at something quicker.
>
> Michael
>
>
> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>>
>>> The profiling I attached in my previous email is for 24 geno fields, as
>>> I said,
>>> but our typical usecase involves only ~4-6 fields, and is faster but
>>> still on
>>> the order of dozens of minutes.
>>>
>>
>> I think Val is arriving at a (much) more efficient implementation, but...
>>
>> I wanted to share my guess that the poor _scaling_ is because the garbage
>> collector runs multiple times as the different strings are pasted together,
>> and has to traverse, in linear time, increasing numbers of allocated SEXPs.
>> So times scale approximately quadratically with the number of rows in the
>> VCF
>>
>> An efficiency is to reduce the number of SEXPs in play by writing out in
>> chunks -- as each chunk is written, the SEXPs become available for
>> collection and are re-used. Here's my toy example
>>
>> time.R
>> ======
>> splitIndices <- function (nx, ncl)
>> {
>>     i <- seq_len(nx)
>>     if (ncl == 0L)
>>         list()
>>     else if (ncl == 1L || nx == 1L)
>>         list(i)
>>     else {
>>         fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>>         breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>>         structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
>>     }
>> }
>>
>> x = as.character(seq_len(1e7)); y = sample(x)
>> if (!is.na(Sys.getenv("SPLIT", NA))) {
>>     idx <- splitIndices(length(x), 20)
>>     system.time(for (i in idx) paste(x[i], y[i], sep=":"))
>> } else {
>>     system.time(paste(x, y, sep=":"))
>> }
>>
>>
>> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
>> relevant time is
>>
>>    user  system elapsed
>>  15.320   0.064  15.381
>>
>> versus with $ R --no-save --quiet -f time.R it is
>>
>>    user  system elapsed
>>  95.360   0.164  95.511
>>
>> I think this is likely an overall strategy when dealing with character
>> data -- processing in independent chunks of moderate (1M?) size (enabling
>> as a consequence parallel evaluation in modest memory) that are sufficient
>> to benefit from vectorization, but that do not entail allocation of large
>> numbers of in-use SEXPs.
>>
>> Martin
>>
>>
>>> Sorry for the confusion.
>>> ~G
>>>
>>>
>>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
>>> <mailto:beckerg4 at gene.com>> wrote:
>>>
>>>     Martin and Val.
>>>
>>>     I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
>>> fields) with
>>>     profiling enabled. The results of summaryRprof for that run are
>>> attached,
>>>     though for a variety of reasons they are pretty misleading.
>>>
>>>     It took over an hour to write (3700+seconds), so it's definitely a
>>>     bottleneck when the data get very large, even if it isn't for
>>> smaller data.
>>>
>>>     Michael and I both think the culprit is all the pasting and cbinding
>>> that is
>>>     going on, and more to the point, that memory for an internal
>>> representation
>>>     to be written out is allocated at all.  Streaming across the object,
>>> looping
>>>     by rows and writing directly to file (e.g. from C) should be
>>> blisteringly
>>>     fast in comparison.
>>>
>>>     ~G
>>>
>>>
>>>     On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <
>>> michafla at gene.com
>>>     <mailto:michafla at gene.com>> wrote:
>>>
>>>         Gabe is still testing/profiling, but we'll send something
>>> randomized
>>>         along eventually.
>>>
>>>
>>>         On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
>>> mtmorgan at fhcrc.org
>>>         <mailto:mtmorgan at fhcrc.org>> wrote:
>>>
>>>             I didn't see in the original thread a reproducible
>>> (simulated, I
>>>             guess) example, to be explicit about what the problem is??
>>>
>>>             Martin
>>>
>>>
>>>             On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>>
>>>                 My understanding is that the heap optimization provided
>>> marginal
>>>                 gains, and
>>>                 that we need to think harder about how to optimize the
>>> all of
>>>                 the string
>>>                 manipulation in writeVcf. We either need to reduce it or
>>> reduce its
>>>                 overhead (i.e., the CHARSXP allocation). Gabe is doing
>>> more tests.
>>>
>>>
>>>                 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
>>>                 <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>>>
>>>                 wrote:
>>>
>>>                     Hi Gabe,
>>>
>>>                     Martin responded, and so did Michael,
>>>
>>>                     https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
>>> August/006082.html
>>>
>>>                     <https://stat.ethz.ch/pipermail/bioc-devel/2014-
>>> August/006082.html>
>>>
>>>                     It sounded like Michael was ok with working
>>> with/around heap
>>>                     initialization.
>>>
>>>                     Michael, is that right or should we still consider
>>> this on
>>>                     the table?
>>>
>>>
>>>                     Val
>>>
>>>
>>>                     On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>>
>>>                         Val,
>>>
>>>                         Has there been any movement on this? This
>>> remains a
>>>                         substantial
>>>                         bottleneck for us when writing very large VCF
>>> files (e.g.
>>>                         variants+genotypes for whole genome NGS samples).
>>>
>>>                         I was able to see a ~25% speedup with 4 cores
>>> and  an
>>>                         "optimal" speedup
>>>                         of ~2x with 10-12 cores for a VCF with 500k
>>> rows  using
>>>                         a very naive
>>>                         parallelization strategy and no other changes. I
>>> suspect
>>>                         this could be
>>>                         improved on quite a bit, or possibly made
>>> irrelevant
>>>                         with judicious use
>>>                         of serial C code.
>>>
>>>                         Did you and Martin make any plans regarding
>>> optimizing
>>>                         writeVcf?
>>>
>>>                         Best
>>>                         ~G
>>>
>>>
>>>                         On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
>>>                         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>>                         <mailto:vobencha at fhcrc.org <mailto:
>>> vobencha at fhcrc.org>>>
>>>
>>>                         wrote:
>>>
>>>                               Hi Michael,
>>>
>>>                               I'm interested in working on this. I'll
>>> discuss
>>>                         with Martin next
>>>                               week when we're both back in the office.
>>>
>>>                               Val
>>>
>>>
>>>
>>>
>>>
>>>                               On 08/05/14 07:46, Michael Lawrence wrote:
>>>
>>>                                   Hi guys (Val, Martin, Herve):
>>>
>>>                                   Anyone have an itch for optimization?
>>> The
>>>                         writeVcf function is
>>>                                   currently a
>>>                                   bottleneck in our WGS genotyping
>>> pipeline. For
>>>                         a typical 50
>>>                                   million row
>>>                                   gVCF, it was taking 2.25 hours prior to
>>>                         yesterday's improvements
>>>                                   (pasteCollapseRows) that brought it
>>> down to
>>>                         about 1 hour, which
>>>                                   is still
>>>                                   too long by my standards (> 0). Only
>>> takes 3
>>>                         minutes to call the
>>>                                   genotypes
>>>                                   (and associated likelihoods etc) from
>>> the
>>>                         variant calls (using
>>>                                   80 cores and
>>>                                   450 GB RAM on one node), so the output
>>> is an
>>>                         issue. Profiling
>>>                                   suggests that
>>>                                   the running time scales non-linearly
>>> in the
>>>                         number of rows.
>>>
>>>                                   Digging a little deeper, it seems to be
>>>                         something with R's
>>>                                   string/memory
>>>                                   allocation. Below, pasting 1 million
>>> strings
>>>                         takes 6 seconds, but
>>>                         10
>>>                                   million strings takes over 2 minutes.
>>> It gets
>>>                         way worse with 50
>>>                                   million. I
>>>                                   suspect it has something to do with
>>> R's string
>>>                         hash table.
>>>
>>>                                   set.seed(1000)
>>>                                   end <- sample(1e8, 1e6)
>>>                                   system.time(paste0("END", "=", end))
>>>                                        user  system elapsed
>>>                                       6.396   0.028   6.420
>>>
>>>                                   end <- sample(1e8, 1e7)
>>>                                   system.time(paste0("END", "=", end))
>>>                                        user  system elapsed
>>>                                   134.714   0.352 134.978
>>>
>>>                                   Indeed, even this takes a long time
>>> (in a
>>>                         fresh session):
>>>
>>>                                   set.seed(1000)
>>>                                   end <- sample(1e8, 1e6)
>>>                                   end <- sample(1e8, 1e7)
>>>                                   system.time(as.character(end))
>>>                                        user  system elapsed
>>>                                      57.224   0.156  57.366
>>>
>>>                                   But running it a second time is faster
>>> (about
>>>                         what one would
>>>                                   expect?):
>>>
>>>                                   system.time(levels <-
>>> as.character(end))
>>>                                        user  system elapsed
>>>                                      23.582   0.021  23.589
>>>
>>>                                   I did some simple profiling of R to
>>> find that
>>>                         the resizing of
>>>                                   the string
>>>                                   hash table is not a significant
>>> component of
>>>                         the time. So maybe
>>>                                   something
>>>                                   to do with the R heap/gc? No time
>>> right now to
>>>                         go deeper. But I
>>>                                   know Martin
>>>                                   likes this sort of thing ;)
>>>
>>>                                   Michael
>>>
>>>                                            [[alternative HTML version
>>> deleted]]
>>>
>>>
>>>                           ______________________________
>>> _____________________
>>>                         Bioc-devel at r-project.org
>>>                         <mailto:Bioc-devel at r-project.org>
>>>                         <mailto:Bioc-devel at r-project.__org
>>>                         <mailto:Bioc-devel at r-project.org>>
>>>                                   mailing list
>>>
>>>                         https://stat.ethz.ch/mailman/_
>>> ___listinfo/bioc-devel
>>>                         <https://stat.ethz.ch/mailman/
>>> __listinfo/bioc-devel>
>>>
>>>                           <https://stat.ethz.ch/mailman/
>>> __listinfo/bioc-devel
>>>                         <https://stat.ethz.ch/mailman/
>>> listinfo/bioc-devel>>
>>>
>>>
>>>                               ______________________________
>>> _____________________
>>>                         Bioc-devel at r-project.org
>>>                         <mailto:Bioc-devel at r-project.org>
>>>                         <mailto:Bioc-devel at r-project.__org
>>>                         <mailto:Bioc-devel at r-project.org>> mailing
>>>                         list
>>>                         https://stat.ethz.ch/mailman/_
>>> ___listinfo/bioc-devel
>>>                         <https://stat.ethz.ch/mailman/
>>> __listinfo/bioc-devel>
>>>
>>>
>>>
>>>                           <https://stat.ethz.ch/mailman/
>>> __listinfo/bioc-devel
>>>                         <https://stat.ethz.ch/mailman/
>>> listinfo/bioc-devel>>
>>>
>>>
>>>
>>>
>>>                         --
>>>                         Computational Biologist
>>>                         Genentech Research
>>>
>>>
>>>
>>>
>>>
>>>                          [[alternative HTML version deleted]]
>>>
>>>                 _________________________________________________
>>>                 Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.
>>> org>
>>>                 mailing list
>>>                 https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>                 <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>>
>>>             --
>>>             Computational Biology / Fred Hutchinson Cancer Research
>>> Center
>>>             1100 Fairview Ave. N.
>>>             PO Box 19024 Seattle, WA 98109
>>>
>>>             Location: Arnold Building M1 B861
>>>             Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     --
>>>     Computational Biologist
>>>     Genentech Research
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>
>

-- 
Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]