[Bioc-devel] writeVcf performance
Valerie Obenchain
vobencha at fhcrc.org
Mon Sep 8 21:00:37 CEST 2014
fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...
Val
On 09/08/2014 08:43 AM, Gabe Becker wrote:
> Val,
>
> That is great. I'll check this out and test it on our end.
>
> ~G
>
> On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> The new writeVcf code is in 1.11.28.
>
> Using the illumina file you suggested, geno fields only, writing now
> takes about 17 minutes.
>
> > hdr
> class: VCFHeader
> samples(1): NA12877
> meta(6): fileformat ApplyRecalibration ... reference source
> fixed(1): FILTER
> info(22): AC AF ... culprit set
> geno(8): GT GQX ... PL VF
>
> > param = ScanVcfParam(info=NA)
> > vcf = readVcf(fl, "", param=param)
> > dim(vcf)
> [1] 51612762 1
>
> > system.time(writeVcf(vcf, "out.vcf"))
> user system elapsed
> 971.032 6.568 1004.593
>
> In 1.11.28, parsing of geno data was moved to C. If this didn't
> speed things up enough we were planning to implement 'chunking'
> through the VCF and/or move the parsing of info to C, however, it
> looks like geno was the bottleneck.
>
> I've tested a number of samples/fields combinations in files with >=
> .5 million rows and the improvement over writeVcf() in release is ~ 90%.
>
> Valerie
>
>
>
>
> On 09/04/14 15:28, Valerie Obenchain wrote:
>
> Thanks Gabe. I should have something for you on Monday.
>
> Val
>
>
> On 09/04/2014 01:56 PM, Gabe Becker wrote:
>
> Val and Martin,
>
> Apologies for the delay.
>
> We realized that the Illumina platinum genome vcf files make
> a good test
> case, assuming you strip out all the info (info=NA when
> reading it into
> R) stuff.
>
> ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz
> <ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
> took about ~4.2 hrs to write out, and is about 1.5x the size
> of the
> files we are actually dealing with (~50M ranges vs our ~30M).
>
> Looking forward a new vastly improved writeVcf :).
>
> ~G
>
>
> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
> <lawrence.michael at gene.com
> <mailto:lawrence.michael at gene.com>
> <mailto:lawrence.michael at gene.__com
> <mailto:lawrence.michael at gene.com>>> wrote:
>
> Yes, it's very clear that the scaling is non-linear,
> and Gabe has
> been experimenting with a chunk-wise + parallel algorithm.
> Unfortunately there is some frustrating overhead with the
> parallelism. But I'm glad Val is arriving at something
> quicker.
>
> Michael
>
>
> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>>> wrote:
>
> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
> The profiling I attached in my previous email
> is for 24 geno
> fields, as I said,
> but our typical usecase involves only ~4-6
> fields, and is
> faster but still on
> the order of dozens of minutes.
>
>
> I think Val is arriving at a (much) more efficient
> implementation, but...
>
> I wanted to share my guess that the poor _scaling_
> is because
> the garbage collector runs multiple times as the
> different
> strings are pasted together, and has to traverse,
> in linear
> time, increasing numbers of allocated SEXPs. So
> times scale
> approximately quadratically with the number of rows
> in the VCF
>
> An efficiency is to reduce the number of SEXPs in
> play by
> writing out in chunks -- as each chunk is written,
> the SEXPs
> become available for collection and are re-used.
> Here's my toy
> example
>
> time.R
> ======
> splitIndices <- function (nx, ncl)
> {
> i <- seq_len(nx)
> if (ncl == 0L)
> list()
> else if (ncl == 1L || nx == 1L)
> list(i)
> else {
> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
> breaks <- seq(1 - fuzz, nx + fuzz, length
> = ncl + 1L)
> structure(split(i, cut(i, breaks,
> labels=FALSE)), names
> = NULL)
> }
> }
>
> x = as.character(seq_len(1e7)); y = sample(x)
> if (!is.na <http://is.na>
> <http://is.na>(Sys.getenv("__SPLIT", NA))) {
> idx <- splitIndices(length(x), 20)
> system.time(for (i in idx) paste(x[i], y[i],
> sep=":"))
> } else {
> system.time(paste(x, y, sep=":"))
> }
>
>
> running under R-devel with $ SPLIT=TRUE R --no-save
> --quiet -f
> time.R the relevant time is
>
> user system elapsed
> 15.320 0.064 15.381
>
> versus with $ R --no-save --quiet -f time.R it is
>
> user system elapsed
> 95.360 0.164 95.511
>
> I think this is likely an overall strategy when
> dealing with
> character data -- processing in independent chunks
> of moderate
> (1M?) size (enabling as a consequence parallel
> evaluation in
> modest memory) that are sufficient to benefit from
> vectorization, but that do not entail allocation of
> large
> numbers of in-use SEXPs.
>
> Martin
>
>
> Sorry for the confusion.
> ~G
>
>
> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
> <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
> <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
> <mailto:beckerg4 at gene.com
> <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
> <mailto:beckerg4 at gene.com>>>> wrote:
>
> Martin and Val.
>
> I re-ran writeVcf on our (G)VCF data
> (34790518 ranges,
> 24 geno fields) with
> profiling enabled. The results of
> summaryRprof for that
> run are attached,
> though for a variety of reasons they are
> pretty
> misleading.
>
> It took over an hour to write
> (3700+seconds), so it's
> definitely a
> bottleneck when the data get very large,
> even if it
> isn't for smaller data.
>
> Michael and I both think the culprit is
> all the pasting
> and cbinding that is
> going on, and more to the point, that
> memory for an
> internal representation
> to be written out is allocated at all.
> Streaming
> across the object, looping
> by rows and writing directly to file (e.g.
> from C)
> should be blisteringly
> fast in comparison.
>
> ~G
>
>
> On Tue, Aug 26, 2014 at 11:57 AM, Michael
> Lawrence
> <michafla at gene.com <mailto:michafla at gene.com>
> <mailto:michafla at gene.com <mailto:michafla at gene.com>>
> <mailto:michafla at gene.com
> <mailto:michafla at gene.com> <mailto:michafla at gene.com
> <mailto:michafla at gene.com>>>>
> wrote:
>
> Gabe is still testing/profiling, but
> we'll send
> something randomized
> along eventually.
>
>
> On Tue, Aug 26, 2014 at 11:15 AM,
> Martin Morgan
> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>
> <mailto:mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>>>> wrote:
>
> I didn't see in the original thread a
> reproducible (simulated, I
> guess) example, to be explicit
> about what the
> problem is??
>
> Martin
>
>
> On 08/26/2014 10:47 AM, Michael
> Lawrence wrote:
>
> My understanding is that the heap
> optimization provided marginal
> gains, and
> that we need to think harder
> about how to
> optimize the all of
> the string
> manipulation in writeVcf. We
> either need to
> reduce it or reduce its
> overhead (i.e., the CHARSXP
> allocation).
> Gabe is doing more tests.
>
>
> On Tue, Aug 26, 2014 at 9:43
> AM, Valerie
> Obenchain
> <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>>>>
>
> wrote:
>
> Hi Gabe,
>
> Martin responded, and so
> did Michael,
>
>
> https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
> <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>
>
> <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>
>
>
>
> <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
>
> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
> <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>
>
> It sounded like Michael
> was ok with
> working with/around heap
> initialization.
>
> Michael, is that right or
> should we
> still consider this on
> the table?
>
>
> Val
>
>
> On 08/26/2014 09:34 AM,
> Gabe Becker
> wrote:
>
> Val,
>
> Has there been any
> movement on
> this? This remains a
> substantial
> bottleneck for us when
> writing very
> large VCF files (e.g.
> variants+genotypes for
> whole genome
> NGS samples).
>
> I was able to see a
> ~25% speedup
> with 4 cores and an
> "optimal" speedup
> of ~2x with 10-12
> cores for a VCF
> with 500k rows using
> a very naive
> parallelization
> strategy and no
> other changes. I suspect
> this could be
> improved on quite a
> bit, or
> possibly made irrelevant
> with judicious use
> of serial C code.
>
> Did you and Martin
> make any plans
> regarding optimizing
> writeVcf?
>
> Best
> ~G
>
>
> On Tue, Aug 5, 2014 at
> 2:33 PM,
> Valerie Obenchain
> <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>>>
>
> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>
> <mailto:vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>>>>>
>
> wrote:
>
> Hi Michael,
>
> I'm interested
> in working on
> this. I'll discuss
> with Martin next
> week when we're
> both back in
> the office.
>
> Val
>
>
>
>
>
> On 08/05/14
> 07:46, Michael
> Lawrence wrote:
>
> Hi guys
> (Val, Martin,
> Herve):
>
> Anyone have
> an itch for
> optimization? The
> writeVcf function is
> currently a
> bottleneck
> in our WGS
> genotyping pipeline. For
> a typical 50
> million row
> gVCF, it was
> taking 2.25
> hours prior to
> yesterday's improvements
>
> (pasteCollapseRows) that
> brought it down to
> about 1 hour, which
> is still
> too long by
> my standards
> (> 0). Only takes 3
> minutes to call the
> genotypes
> (and associated
> likelihoods etc) from the
> variant calls (using
> 80 cores and
> 450 GB RAM
> on one node),
> so the output is an
> issue. Profiling
> suggests that
> the running
> time scales
> non-linearly in the
> number of rows.
>
> Digging a
> little deeper,
> it seems to be
> something with R's
> string/memory
> allocation.
> Below,
> pasting 1 million strings
> takes 6 seconds, but
> 10
> million
> strings takes
> over 2 minutes. It gets
> way worse with 50
> million. I
> suspect it
> has something
> to do with R's string
> hash table.
>
> set.seed(1000)
> end <-
> sample(1e8, 1e6)
>
> system.time(paste0("END",
> "=", end))
> user
> system elapsed
> 6.396
> 0.028 6.420
>
> end <-
> sample(1e8, 1e7)
>
> system.time(paste0("END",
> "=", end))
> user
> system elapsed
> 134.714
> 0.352 134.978
>
> Indeed, even
> this takes a
> long time (in a
> fresh session):
>
> set.seed(1000)
> end <-
> sample(1e8, 1e6)
> end <-
> sample(1e8, 1e7)
>
> system.time(as.character(end))
> user
> system elapsed
> 57.224
> 0.156 57.366
>
> But running
> it a second
> time is faster (about
> what one would
> expect?):
>
>
> system.time(levels <-
> as.character(end))
> user
> system elapsed
> 23.582
> 0.021 23.589
>
> I did some
> simple
> profiling of R to find that
> the resizing of
> the string
> hash table
> is not a
> significant component of
> the time. So maybe
> something
> to do with
> the R heap/gc?
> No time right now to
> go deeper. But I
> know Martin
> likes this
> sort of
> thing ;)
>
> Michael
>
>
> [[alternative
> HTML version deleted]]
>
>
>
>
> _______________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>____org
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>>
>
> <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
> <mailto:Bioc-devel at r-project
> <mailto:Bioc-devel at r-project>.>______org
>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>____org
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>>>
> mailing list
>
> https://stat.ethz.ch/mailman/________listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>
>
>
> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>
>
>
>
>
> _______________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>____org
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>>
>
> <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
> <mailto:Bioc-devel at r-project
> <mailto:Bioc-devel at r-project>.>______org
>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>____org
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>>> mailing
> list
> https://stat.ethz.ch/mailman/________listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>
>
>
>
>
> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>
>
>
>
>
> --
> Computational Biologist
> Genentech Research
>
>
>
>
>
> [[alternative HTML
> version
> deleted]]
>
>
>
> _____________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>
> <mailto:Bioc-devel at r-project.
> <mailto:Bioc-devel at r-project.>____org
> <mailto:Bioc-devel at r-project.__org
> <mailto:Bioc-devel at r-project.org>>>
> mailing list
> https://stat.ethz.ch/mailman/______listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>
>
>
> --
> Computational Biology / Fred
> Hutchinson Cancer
> Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> <tel:%28206%29%20667-2793>
> <tel:%28206%29%20667-2793>
> <tel:%28206%29%20667-2793>
>
>
>
>
>
>
> --
> Computational Biologist
> Genentech Research
>
>
>
>
> --
> Computational Biologist
> Genentech Research
>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer
> Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
> <tel:%28206%29%20667-2793>
>
>
>
>
>
> --
> Computational Biologist
> Genentech Research
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>
> --
> Computational Biologist
> Genentech Research
More information about the Bioc-devel
mailing list