[Bioc-devel] writeVcf performance
Valerie Obenchain
vobencha at fhcrc.org
Fri Sep 5 00:28:04 CEST 2014
Thanks Gabe. I should have something for you on Monday.
Val
On 09/04/2014 01:56 PM, Gabe Becker wrote:
> Val and Martin,
>
> Apologies for the delay.
>
> We realized that the Illumina platinum genome vcf files make a good test
> case, assuming you strip out all the info (info=NA when reading it into
> R) stuff.
>
> ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
> took about ~4.2 hrs to write out, and is about 1.5x the size of the
> files we are actually dealing with (~50M ranges vs our ~30M).
>
> Looking forward a new vastly improved writeVcf :).
>
> ~G
>
>
> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>
> Yes, it's very clear that the scaling is non-linear, and Gabe has
> been experimenting with a chunk-wise + parallel algorithm.
> Unfortunately there is some frustrating overhead with the
> parallelism. But I'm glad Val is arriving at something quicker.
>
> Michael
>
>
> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
> The profiling I attached in my previous email is for 24 geno
> fields, as I said,
> but our typical usecase involves only ~4-6 fields, and is
> faster but still on
> the order of dozens of minutes.
>
>
> I think Val is arriving at a (much) more efficient
> implementation, but...
>
> I wanted to share my guess that the poor _scaling_ is because
> the garbage collector runs multiple times as the different
> strings are pasted together, and has to traverse, in linear
> time, increasing numbers of allocated SEXPs. So times scale
> approximately quadratically with the number of rows in the VCF
>
> An efficiency is to reduce the number of SEXPs in play by
> writing out in chunks -- as each chunk is written, the SEXPs
> become available for collection and are re-used. Here's my toy
> example
>
> time.R
> ======
> splitIndices <- function (nx, ncl)
> {
> i <- seq_len(nx)
> if (ncl == 0L)
> list()
> else if (ncl == 1L || nx == 1L)
> list(i)
> else {
> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
> structure(split(i, cut(i, breaks, labels=FALSE)), names
> = NULL)
> }
> }
>
> x = as.character(seq_len(1e7)); y = sample(x)
> if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
> idx <- splitIndices(length(x), 20)
> system.time(for (i in idx) paste(x[i], y[i], sep=":"))
> } else {
> system.time(paste(x, y, sep=":"))
> }
>
>
> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
> time.R the relevant time is
>
> user system elapsed
> 15.320 0.064 15.381
>
> versus with $ R --no-save --quiet -f time.R it is
>
> user system elapsed
> 95.360 0.164 95.511
>
> I think this is likely an overall strategy when dealing with
> character data -- processing in independent chunks of moderate
> (1M?) size (enabling as a consequence parallel evaluation in
> modest memory) that are sufficient to benefit from
> vectorization, but that do not entail allocation of large
> numbers of in-use SEXPs.
>
> Martin
>
>
> Sorry for the confusion.
> ~G
>
>
> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
> <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
> <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>> wrote:
>
> Martin and Val.
>
> I re-ran writeVcf on our (G)VCF data (34790518 ranges,
> 24 geno fields) with
> profiling enabled. The results of summaryRprof for that
> run are attached,
> though for a variety of reasons they are pretty misleading.
>
> It took over an hour to write (3700+seconds), so it's
> definitely a
> bottleneck when the data get very large, even if it
> isn't for smaller data.
>
> Michael and I both think the culprit is all the pasting
> and cbinding that is
> going on, and more to the point, that memory for an
> internal representation
> to be written out is allocated at all. Streaming
> across the object, looping
> by rows and writing directly to file (e.g. from C)
> should be blisteringly
> fast in comparison.
>
> ~G
>
>
> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
> <michafla at gene.com <mailto:michafla at gene.com>
> <mailto:michafla at gene.com <mailto:michafla at gene.com>>>
> wrote:
>
> Gabe is still testing/profiling, but we'll send
> something randomized
> along eventually.
>
>
> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>>> wrote:
>
> I didn't see in the original thread a
> reproducible (simulated, I
> guess) example, to be explicit about what the
> problem is??
>
> Martin
>
>
> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>
> My understanding is that the heap
> optimization provided marginal
> gains, and
> that we need to think harder about how to
> optimize the all of
> the string
> manipulation in writeVcf. We either need to
> reduce it or reduce its
> overhead (i.e., the CHARSXP allocation).
> Gabe is doing more tests.
>
>
> On Tue, Aug 26, 2014 at 9:43 AM, Valerie
> Obenchain
> <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>>>
>
> wrote:
>
> Hi Gabe,
>
> Martin responded, and so did Michael,
>
> https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
>
>
> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
> <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>
>
> It sounded like Michael was ok with
> working with/around heap
> initialization.
>
> Michael, is that right or should we
> still consider this on
> the table?
>
>
> Val
>
>
> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>
> Val,
>
> Has there been any movement on
> this? This remains a
> substantial
> bottleneck for us when writing very
> large VCF files (e.g.
> variants+genotypes for whole genome
> NGS samples).
>
> I was able to see a ~25% speedup
> with 4 cores and an
> "optimal" speedup
> of ~2x with 10-12 cores for a VCF
> with 500k rows using
> a very naive
> parallelization strategy and no
> other changes. I suspect
> this could be
> improved on quite a bit, or
> possibly made irrelevant
> with judicious use
> of serial C code.
>
> Did you and Martin make any plans
> regarding optimizing
> writeVcf?
>
> Best
> ~G
>
>
> On Tue, Aug 5, 2014 at 2:33 PM,
> Valerie Obenchain
> <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>>
> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>>>>
>
> wrote:
>
> Hi Michael,
>
> I'm interested in working on
> this. I'll discuss
> with Martin next
> week when we're both back in
> the office.
>
> Val
>
>
>
>
>
> On 08/05/14 07:46, Michael
> Lawrence wrote:
>
> Hi guys (Val, Martin, Herve):
>
> Anyone have an itch for
> optimization? The
> writeVcf function is
> currently a
> bottleneck in our WGS
> genotyping pipeline. For
> a typical 50
> million row
> gVCF, it was taking 2.25
> hours prior to
> yesterday's improvements
> (pasteCollapseRows) that
> brought it down to
> about 1 hour, which
> is still
> too long by my standards
> (> 0). Only takes 3
> minutes to call the
> genotypes
> (and associated
> likelihoods etc) from the
> variant calls (using
> 80 cores and
> 450 GB RAM on one node),
> so the output is an
> issue. Profiling
> suggests that
> the running time scales
> non-linearly in the
> number of rows.
>
> Digging a little deeper,
> it seems to be
> something with R's
> string/memory
> allocation. Below,
> pasting 1 million strings
> takes 6 seconds, but
> 10
> million strings takes
> over 2 minutes. It gets
> way worse with 50
> million. I
> suspect it has something
> to do with R's string
> hash table.
>
> set.seed(1000)
> end <- sample(1e8, 1e6)
> system.time(paste0("END",
> "=", end))
> user system elapsed
> 6.396 0.028 6.420
>
> end <- sample(1e8, 1e7)
> system.time(paste0("END",
> "=", end))
> user system elapsed
> 134.714 0.352 134.978
>
> Indeed, even this takes a
> long time (in a
> fresh session):
>
> set.seed(1000)
> end <- sample(1e8, 1e6)
> end <- sample(1e8, 1e7)
>
> system.time(as.character(end))
> user system elapsed
> 57.224 0.156 57.366
>
> But running it a second
> time is faster (about
> what one would
> expect?):
>
> system.time(levels <-
> as.character(end))
> user system elapsed
> 23.582 0.021 23.589
>
> I did some simple
> profiling of R to find that
> the resizing of
> the string
> hash table is not a
> significant component of
> the time. So maybe
> something
> to do with the R heap/gc?
> No time right now to
> go deeper. But I
> know Martin
> likes this sort of thing ;)
>
> Michael
>
> [[alternative
> HTML version deleted]]
>
>
>
> _____________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>____org
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>>
> mailing list
>
> https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>
>
>
> _____________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>____org
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>> mailing
> list
> https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>
>
>
>
> --
> Computational Biologist
> Genentech Research
>
>
>
>
>
> [[alternative HTML version deleted]]
>
>
> ___________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
> mailing list
> https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer
> Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
>
>
>
>
>
>
> --
> Computational Biologist
> Genentech Research
>
>
>
>
> --
> Computational Biologist
> Genentech Research
>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
>
>
>
> --
> Computational Biologist
> Genentech Research
More information about the Bioc-devel
mailing list