[Bioc-devel] writeVcf performance
Gabe Becker
becker.gabe at gene.com
Mon Sep 8 17:43:15 CEST 2014
Val,
That is great. I'll check this out and test it on our end.
~G
On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:
> The new writeVcf code is in 1.11.28.
>
> Using the illumina file you suggested, geno fields only, writing now takes
> about 17 minutes.
>
> > hdr
> class: VCFHeader
> samples(1): NA12877
> meta(6): fileformat ApplyRecalibration ... reference source
> fixed(1): FILTER
> info(22): AC AF ... culprit set
> geno(8): GT GQX ... PL VF
>
> > param = ScanVcfParam(info=NA)
> > vcf = readVcf(fl, "", param=param)
> > dim(vcf)
> [1] 51612762 1
>
> > system.time(writeVcf(vcf, "out.vcf"))
> user system elapsed
> 971.032 6.568 1004.593
>
> In 1.11.28, parsing of geno data was moved to C. If this didn't speed
> things up enough we were planning to implement 'chunking' through the VCF
> and/or move the parsing of info to C, however, it looks like geno was the
> bottleneck.
>
> I've tested a number of samples/fields combinations in files with >= .5
> million rows and the improvement over writeVcf() in release is ~ 90%.
>
> Valerie
>
>
>
>
> On 09/04/14 15:28, Valerie Obenchain wrote:
>
>> Thanks Gabe. I should have something for you on Monday.
>>
>> Val
>>
>>
>> On 09/04/2014 01:56 PM, Gabe Becker wrote:
>>
>>> Val and Martin,
>>>
>>> Apologies for the delay.
>>>
>>> We realized that the Illumina platinum genome vcf files make a good test
>>> case, assuming you strip out all the info (info=NA when reading it into
>>> R) stuff.
>>>
>>> ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
>>> took about ~4.2 hrs to write out, and is about 1.5x the size of the
>>> files we are actually dealing with (~50M ranges vs our ~30M).
>>>
>>> Looking forward a new vastly improved writeVcf :).
>>>
>>> ~G
>>>
>>>
>>> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
>>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>>>
>>> Yes, it's very clear that the scaling is non-linear, and Gabe has
>>> been experimenting with a chunk-wise + parallel algorithm.
>>> Unfortunately there is some frustrating overhead with the
>>> parallelism. But I'm glad Val is arriving at something quicker.
>>>
>>> Michael
>>>
>>>
>>> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org
>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>
>>> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>>>
>>> The profiling I attached in my previous email is for 24 geno
>>> fields, as I said,
>>> but our typical usecase involves only ~4-6 fields, and is
>>> faster but still on
>>> the order of dozens of minutes.
>>>
>>>
>>> I think Val is arriving at a (much) more efficient
>>> implementation, but...
>>>
>>> I wanted to share my guess that the poor _scaling_ is because
>>> the garbage collector runs multiple times as the different
>>> strings are pasted together, and has to traverse, in linear
>>> time, increasing numbers of allocated SEXPs. So times scale
>>> approximately quadratically with the number of rows in the VCF
>>>
>>> An efficiency is to reduce the number of SEXPs in play by
>>> writing out in chunks -- as each chunk is written, the SEXPs
>>> become available for collection and are re-used. Here's my toy
>>> example
>>>
>>> time.R
>>> ======
>>> splitIndices <- function (nx, ncl)
>>> {
>>> i <- seq_len(nx)
>>> if (ncl == 0L)
>>> list()
>>> else if (ncl == 1L || nx == 1L)
>>> list(i)
>>> else {
>>> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>>> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>>> structure(split(i, cut(i, breaks, labels=FALSE)), names
>>> = NULL)
>>> }
>>> }
>>>
>>> x = as.character(seq_len(1e7)); y = sample(x)
>>> if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
>>> idx <- splitIndices(length(x), 20)
>>> system.time(for (i in idx) paste(x[i], y[i], sep=":"))
>>> } else {
>>> system.time(paste(x, y, sep=":"))
>>> }
>>>
>>>
>>> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
>>> time.R the relevant time is
>>>
>>> user system elapsed
>>> 15.320 0.064 15.381
>>>
>>> versus with $ R --no-save --quiet -f time.R it is
>>>
>>> user system elapsed
>>> 95.360 0.164 95.511
>>>
>>> I think this is likely an overall strategy when dealing with
>>> character data -- processing in independent chunks of moderate
>>> (1M?) size (enabling as a consequence parallel evaluation in
>>> modest memory) that are sufficient to benefit from
>>> vectorization, but that do not entail allocation of large
>>> numbers of in-use SEXPs.
>>>
>>> Martin
>>>
>>>
>>> Sorry for the confusion.
>>> ~G
>>>
>>>
>>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
>>> <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
>>> <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>>
>>> wrote:
>>>
>>> Martin and Val.
>>>
>>> I re-ran writeVcf on our (G)VCF data (34790518 ranges,
>>> 24 geno fields) with
>>> profiling enabled. The results of summaryRprof for that
>>> run are attached,
>>> though for a variety of reasons they are pretty
>>> misleading.
>>>
>>> It took over an hour to write (3700+seconds), so it's
>>> definitely a
>>> bottleneck when the data get very large, even if it
>>> isn't for smaller data.
>>>
>>> Michael and I both think the culprit is all the pasting
>>> and cbinding that is
>>> going on, and more to the point, that memory for an
>>> internal representation
>>> to be written out is allocated at all. Streaming
>>> across the object, looping
>>> by rows and writing directly to file (e.g. from C)
>>> should be blisteringly
>>> fast in comparison.
>>>
>>> ~G
>>>
>>>
>>> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
>>> <michafla at gene.com <mailto:michafla at gene.com>
>>> <mailto:michafla at gene.com <mailto:michafla at gene.com>>>
>>> wrote:
>>>
>>> Gabe is still testing/profiling, but we'll send
>>> something randomized
>>> along eventually.
>>>
>>>
>>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
>>> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>>> <mailto:mtmorgan at fhcrc.org
>>> <mailto:mtmorgan at fhcrc.org>>> wrote:
>>>
>>> I didn't see in the original thread a
>>> reproducible (simulated, I
>>> guess) example, to be explicit about what the
>>> problem is??
>>>
>>> Martin
>>>
>>>
>>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>>
>>> My understanding is that the heap
>>> optimization provided marginal
>>> gains, and
>>> that we need to think harder about how to
>>> optimize the all of
>>> the string
>>> manipulation in writeVcf. We either need to
>>> reduce it or reduce its
>>> overhead (i.e., the CHARSXP allocation).
>>> Gabe is doing more tests.
>>>
>>>
>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie
>>> Obenchain
>>> <vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org>>>
>>>
>>> wrote:
>>>
>>> Hi Gabe,
>>>
>>> Martin responded, and so did Michael,
>>>
>>>
>>> https://stat.ethz.ch/____pipermail/bioc-devel/2014-____
>>> August/006082.html
>>>
>>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
>>>
>>>
>>>
>>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
>>>
>>> <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>
>>>
>>> It sounded like Michael was ok with
>>> working with/around heap
>>> initialization.
>>>
>>> Michael, is that right or should we
>>> still consider this on
>>> the table?
>>>
>>>
>>> Val
>>>
>>>
>>> On 08/26/2014 09:34 AM, Gabe Becker
>>> wrote:
>>>
>>> Val,
>>>
>>> Has there been any movement on
>>> this? This remains a
>>> substantial
>>> bottleneck for us when writing very
>>> large VCF files (e.g.
>>> variants+genotypes for whole genome
>>> NGS samples).
>>>
>>> I was able to see a ~25% speedup
>>> with 4 cores and an
>>> "optimal" speedup
>>> of ~2x with 10-12 cores for a VCF
>>> with 500k rows using
>>> a very naive
>>> parallelization strategy and no
>>> other changes. I suspect
>>> this could be
>>> improved on quite a bit, or
>>> possibly made irrelevant
>>> with judicious use
>>> of serial C code.
>>>
>>> Did you and Martin make any plans
>>> regarding optimizing
>>> writeVcf?
>>>
>>> Best
>>> ~G
>>>
>>>
>>> On Tue, Aug 5, 2014 at 2:33 PM,
>>> Valerie Obenchain
>>> <vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org>>
>>> <mailto:vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>>> <mailto:vobencha at fhcrc.org>>>>
>>>
>>> wrote:
>>>
>>> Hi Michael,
>>>
>>> I'm interested in working on
>>> this. I'll discuss
>>> with Martin next
>>> week when we're both back in
>>> the office.
>>>
>>> Val
>>>
>>>
>>>
>>>
>>>
>>> On 08/05/14 07:46, Michael
>>> Lawrence wrote:
>>>
>>> Hi guys (Val, Martin,
>>> Herve):
>>>
>>> Anyone have an itch for
>>> optimization? The
>>> writeVcf function is
>>> currently a
>>> bottleneck in our WGS
>>> genotyping pipeline. For
>>> a typical 50
>>> million row
>>> gVCF, it was taking 2.25
>>> hours prior to
>>> yesterday's improvements
>>> (pasteCollapseRows) that
>>> brought it down to
>>> about 1 hour, which
>>> is still
>>> too long by my standards
>>> (> 0). Only takes 3
>>> minutes to call the
>>> genotypes
>>> (and associated
>>> likelihoods etc) from the
>>> variant calls (using
>>> 80 cores and
>>> 450 GB RAM on one node),
>>> so the output is an
>>> issue. Profiling
>>> suggests that
>>> the running time scales
>>> non-linearly in the
>>> number of rows.
>>>
>>> Digging a little deeper,
>>> it seems to be
>>> something with R's
>>> string/memory
>>> allocation. Below,
>>> pasting 1 million strings
>>> takes 6 seconds, but
>>> 10
>>> million strings takes
>>> over 2 minutes. It gets
>>> way worse with 50
>>> million. I
>>> suspect it has something
>>> to do with R's string
>>> hash table.
>>>
>>> set.seed(1000)
>>> end <- sample(1e8, 1e6)
>>> system.time(paste0("END",
>>> "=", end))
>>> user system elapsed
>>> 6.396 0.028 6.420
>>>
>>> end <- sample(1e8, 1e7)
>>> system.time(paste0("END",
>>> "=", end))
>>> user system elapsed
>>> 134.714 0.352 134.978
>>>
>>> Indeed, even this takes a
>>> long time (in a
>>> fresh session):
>>>
>>> set.seed(1000)
>>> end <- sample(1e8, 1e6)
>>> end <- sample(1e8, 1e7)
>>>
>>> system.time(as.character(end))
>>> user system elapsed
>>> 57.224 0.156 57.366
>>>
>>> But running it a second
>>> time is faster (about
>>> what one would
>>> expect?):
>>>
>>> system.time(levels <-
>>> as.character(end))
>>> user system elapsed
>>> 23.582 0.021 23.589
>>>
>>> I did some simple
>>> profiling of R to find that
>>> the resizing of
>>> the string
>>> hash table is not a
>>> significant component of
>>> the time. So maybe
>>> something
>>> to do with the R heap/gc?
>>> No time right now to
>>> go deeper. But I
>>> know Martin
>>> likes this sort of
>>> thing ;)
>>>
>>> Michael
>>>
>>> [[alternative
>>> HTML version deleted]]
>>>
>>>
>>>
>>> _____________________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> <mailto:Bioc-devel at r-project.__org
>>> <mailto:Bioc-devel at r-project.org>>
>>> <mailto:Bioc-devel at r-project.
>>> <mailto:Bioc-devel at r-project.>____org
>>> <mailto:Bioc-devel at r-project.__org
>>> <mailto:Bioc-devel at r-project.org>>>
>>> mailing list
>>>
>>> https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>>
>>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>>>
>>>
>>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>>>
>>>
>>>
>>> _____________________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> <mailto:Bioc-devel at r-project.__org
>>> <mailto:Bioc-devel at r-project.org>>
>>> <mailto:Bioc-devel at r-project.
>>> <mailto:Bioc-devel at r-project.>____org
>>> <mailto:Bioc-devel at r-project.__org
>>> <mailto:Bioc-devel at r-project.org>>> mailing
>>> list
>>> https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>>
>>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>>>
>>>
>>>
>>>
>>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>>
>>>
>>>
>>>
>>> [[alternative HTML version
>>> deleted]]
>>>
>>>
>>> ___________________________________________________
>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>> <mailto:Bioc-devel at r-project.__org
>>> <mailto:Bioc-devel at r-project.org>>
>>> mailing list
>>> https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>
>>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer
>>> Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>> <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Computational Biologist
>>> Genentech Research
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
--
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list