[Bioc-devel] writeVcf performance
Gabe Becker
becker.gabe at gene.com
Thu Sep 4 22:56:25 CEST 2014
Val and Martin,
Apologies for the delay.
We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into R)
stuff.
ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took
about ~4.2 hrs to write out, and is about 1.5x the size of the files we are
actually dealing with (~50M ranges vs our ~30M).
Looking forward a new vastly improved writeVcf :).
~G
On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence <lawrence.michael at gene.com>
wrote:
> Yes, it's very clear that the scaling is non-linear, and Gabe has been
> experimenting with a chunk-wise + parallel algorithm. Unfortunately there
> is some frustrating overhead with the parallelism. But I'm glad Val is
> arriving at something quicker.
>
> Michael
>
>
> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>>
>>> The profiling I attached in my previous email is for 24 geno fields, as
>>> I said,
>>> but our typical usecase involves only ~4-6 fields, and is faster but
>>> still on
>>> the order of dozens of minutes.
>>>
>>
>> I think Val is arriving at a (much) more efficient implementation, but...
>>
>> I wanted to share my guess that the poor _scaling_ is because the garbage
>> collector runs multiple times as the different strings are pasted together,
>> and has to traverse, in linear time, increasing numbers of allocated SEXPs.
>> So times scale approximately quadratically with the number of rows in the
>> VCF
>>
>> An efficiency is to reduce the number of SEXPs in play by writing out in
>> chunks -- as each chunk is written, the SEXPs become available for
>> collection and are re-used. Here's my toy example
>>
>> time.R
>> ======
>> splitIndices <- function (nx, ncl)
>> {
>> i <- seq_len(nx)
>> if (ncl == 0L)
>> list()
>> else if (ncl == 1L || nx == 1L)
>> list(i)
>> else {
>> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>> structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
>> }
>> }
>>
>> x = as.character(seq_len(1e7)); y = sample(x)
>> if (!is.na(Sys.getenv("SPLIT", NA))) {
>> idx <- splitIndices(length(x), 20)
>> system.time(for (i in idx) paste(x[i], y[i], sep=":"))
>> } else {
>> system.time(paste(x, y, sep=":"))
>> }
>>
>>
>> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
>> relevant time is
>>
>> user system elapsed
>> 15.320 0.064 15.381
>>
>> versus with $ R --no-save --quiet -f time.R it is
>>
>> user system elapsed
>> 95.360 0.164 95.511
>>
>> I think this is likely an overall strategy when dealing with character
>> data -- processing in independent chunks of moderate (1M?) size (enabling
>> as a consequence parallel evaluation in modest memory) that are sufficient
>> to benefit from vectorization, but that do not entail allocation of large
>> numbers of in-use SEXPs.
>>
>> Martin
>>
>>
>>> Sorry for the confusion.
>>> ~G
>>>
>>>
>>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
>>> <mailto:beckerg4 at gene.com>> wrote:
>>>
>>> Martin and Val.
>>>
>>> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
>>> fields) with
>>> profiling enabled. The results of summaryRprof for that run are
>>> attached,
>>> though for a variety of reasons they are pretty misleading.
>>>
>>> It took over an hour to write (3700+seconds), so it's definitely a
>>> bottleneck when the data get very large, even if it isn't for
>>> smaller data.
>>>
>>> Michael and I both think the culprit is all the pasting and cbinding
>>> that is
>>> going on, and more to the point, that memory for an internal
>>> representation
>>> to be written out is allocated at all. Streaming across the object,
>>> looping
>>> by rows and writing directly to file (e.g. from C) should be
>>> blisteringly
>>> fast in comparison.
>>>
>>> ~G
>>>
>>>
>>> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <
>>> michafla at gene.com
>>> <mailto:michafla at gene.com>> wrote:
>>>
>>> Gabe is still testing/profiling, but we'll send something
>>> randomized
>>> along eventually.
>>>
>>>
>>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
>>> mtmorgan at fhcrc.org
>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>
>>> I didn't see in the original thread a reproducible
>>> (simulated, I
>>> guess) example, to be explicit about what the problem is??
>>>
>>> Martin
>>>
>>>
>>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>>
>>> My understanding is that the heap optimization provided
>>> marginal
>>> gains, and
>>> that we need to think harder about how to optimize the
>>> all of
>>> the string
>>> manipulation in writeVcf. We either need to reduce it or
>>> reduce its
>>> overhead (i.e., the CHARSXP allocation). Gabe is doing
>>> more tests.
>>>
>>>
>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
>>> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>>>
>>> wrote:
>>>
>>> Hi Gabe,
>>>
>>> Martin responded, and so did Michael,
>>>
>>> https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
>>> August/006082.html
>>>
>>> <https://stat.ethz.ch/pipermail/bioc-devel/2014-
>>> August/006082.html>
>>>
>>> It sounded like Michael was ok with working
>>> with/around heap
>>> initialization.
>>>
>>> Michael, is that right or should we still consider
>>> this on
>>> the table?
>>>
>>>
>>> Val
>>>
>>>
>>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>>
>>> Val,
>>>
>>> Has there been any movement on this? This
>>> remains a
>>> substantial
>>> bottleneck for us when writing very large VCF
>>> files (e.g.
>>> variants+genotypes for whole genome NGS samples).
>>>
>>> I was able to see a ~25% speedup with 4 cores
>>> and an
>>> "optimal" speedup
>>> of ~2x with 10-12 cores for a VCF with 500k
>>> rows using
>>> a very naive
>>> parallelization strategy and no other changes. I
>>> suspect
>>> this could be
>>> improved on quite a bit, or possibly made
>>> irrelevant
>>> with judicious use
>>> of serial C code.
>>>
>>> Did you and Martin make any plans regarding
>>> optimizing
>>> writeVcf?
>>>
>>> Best
>>> ~G
>>>
>>>
>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
>>> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>> <mailto:vobencha at fhcrc.org <mailto:
>>> vobencha at fhcrc.org>>>
>>>
>>> wrote:
>>>
>>> Hi Michael,
>>>
>>> I'm interested in working on this. I'll
>>> discuss
>>> with Martin next
>>> week when we're both back in the office.
>>>
>>> Val
>>>
>>>
>>>
>>>
>>>
>>> On 08/05/14 07:46, Michael Lawrence wrote:
>>>
>>> Hi guys (Val, Martin, Herve):
>>>
>>> Anyone have an itch for optimization?
>>> The
>>> writeVcf function is
>>> currently a
>>> bottleneck in our WGS genotyping
>>> pipeline. For
>>> a typical 50
>>> million row
>>> gVCF, it was taking 2.25 hours prior to
>>> yesterday's improvements
>>> (pasteCollapseRows) that brought it
>>> down to
>>> about 1 hour, which
>>> is still
>>> too long by my standards (> 0). Only
>>> takes 3
>>> minutes to call the
>>> genotypes
>>> (and associated likelihoods etc) from
>>> the
>>> variant calls (using
>>> 80 cores and
>>> 450 GB RAM on one node), so the output
>>> is an
>>> issue. Profiling
>>> suggests that
>>> the running time scales non-linearly
>>> in the
>>> number of rows.
>>>
>>> Digging a little deeper, it seems to be
>>> something with R's
>>> string/memory
>>> allocation. Below, pasting 1 million
>>> strings
>>> takes 6 seconds, but
>>> 10
>>> million strings takes over 2 minutes.
>>> It gets
>>> way worse with 50
>>> million. I
>>> suspect it has something to do with
>>> R's string
>>> hash table.
>>>
>>> set.seed(1000)
>>> end <- sample(1e8, 1e6)
>>> system.time(paste0("END", "=", end))
>>> user system elapsed
>>> 6.396 0.028 6.420
>>>
>>> end <- sample(1e8, 1e7)
>>> system.time(paste0("END", "=", end))
>>> user system elapsed
>>> 134.714 0.352 134.978
>>>
>>> Indeed, even this takes a long time
>>> (in a
>>> fresh session):
>>>
>>> set.seed(1000)
>>> end <- sample(1e8, 1e6)
>>> end <- sample(1e8, 1e7)
>>> system.time(as.character(end))
>>> user system elapsed
>>> 57.224 0.156 57.366
>>>
>>> But running it a second time is faster
>>> (about
>>> what one would
>>> expect?):
>>>
>>> system.time(levels <-
>>> as.character(end))
>>> user system elapsed
>>> 23.582 0.021 23.589
>>>
>>> I did some simple profiling of R to
>>> find that
>>> the resizing of
>>> the string
>>> hash table is not a significant
>>> component of
>>> the time. So maybe
>>> something
>>> to do with the R heap/gc? No time
>>> right now to
>>> go deeper. But I
>>> know Martin
>>> likes this sort of thing ;)
>>>
>>> Michael
>>>
>>> [[alternative HTML version
>>> deleted]]
>>>
>>>
>>> ______________________________
>>> _____________________
>>> Bioc-devel at r-project.org
>>> <mailto:Bioc-devel at r-project.org>
>>> <mailto:Bioc-devel at r-project.__org
>>> <mailto:Bioc-devel at r-project.org>>
>>> mailing list
>>>
>>> https://stat.ethz.ch/mailman/_
>>> ___listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/
>>> __listinfo/bioc-devel>
>>>
>>> <https://stat.ethz.ch/mailman/
>>> __listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/
>>> listinfo/bioc-devel>>
>>>
>>>
>>> ______________________________
>>> _____________________
>>> Bioc-devel at r-project.org
>>> <mailto:Bioc-devel at r-project.org>
>>> <mailto:Bioc-devel at r-project.__org
>>> <mailto:Bioc-devel at r-project.org>> mailing
>>> list
>>> https://stat.ethz.ch/mailman/_
>>> ___listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/
>>> __listinfo/bioc-devel>
>>>
>>>
>>>
>>> <https://stat.ethz.ch/mailman/
>>> __listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/
>>> listinfo/bioc-devel>>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>>
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _________________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.
>>> org>
>>> mailing list
>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research
>>> Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>
>
--
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list