[Bioc-devel] writeVcf performance
Valerie Obenchain
vobencha at fhcrc.org
Tue Sep 9 23:12:49 CEST 2014
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think. Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).
I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.
Valerie
On 09/08/2014 12:00 PM, Valerie Obenchain wrote:
> fyi Martin found a bug with the treatment of list data (ie, Number =
> '.') in the header. Working on a fix ...
>
> Val
>
>
> On 09/08/2014 08:43 AM, Gabe Becker wrote:
>> Val,
>>
>> That is great. I'll check this out and test it on our end.
>>
>> ~G
>>
>> On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> wrote:
>>
>> The new writeVcf code is in 1.11.28.
>>
>> Using the illumina file you suggested, geno fields only, writing now
>> takes about 17 minutes.
>>
>> > hdr
>> class: VCFHeader
>> samples(1): NA12877
>> meta(6): fileformat ApplyRecalibration ... reference source
>> fixed(1): FILTER
>> info(22): AC AF ... culprit set
>> geno(8): GT GQX ... PL VF
>>
>> > param = ScanVcfParam(info=NA)
>> > vcf = readVcf(fl, "", param=param)
>> > dim(vcf)
>> [1] 51612762 1
>>
>> > system.time(writeVcf(vcf, "out.vcf"))
>> user system elapsed
>> 971.032 6.568 1004.593
>>
>> In 1.11.28, parsing of geno data was moved to C. If this didn't
>> speed things up enough we were planning to implement 'chunking'
>> through the VCF and/or move the parsing of info to C, however, it
>> looks like geno was the bottleneck.
>>
>> I've tested a number of samples/fields combinations in files with >=
>> .5 million rows and the improvement over writeVcf() in release is
>> ~ 90%.
>>
>> Valerie
>>
>>
>>
>>
>> On 09/04/14 15:28, Valerie Obenchain wrote:
>>
>> Thanks Gabe. I should have something for you on Monday.
>>
>> Val
>>
>>
>> On 09/04/2014 01:56 PM, Gabe Becker wrote:
>>
>> Val and Martin,
>>
>> Apologies for the delay.
>>
>> We realized that the Illumina platinum genome vcf files make
>> a good test
>> case, assuming you strip out all the info (info=NA when
>> reading it into
>> R) stuff.
>>
>>
>> ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz
>>
>>
>> <ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
>> took about ~4.2 hrs to write out, and is about 1.5x the size
>> of the
>> files we are actually dealing with (~50M ranges vs our ~30M).
>>
>> Looking forward a new vastly improved writeVcf :).
>>
>> ~G
>>
>>
>> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
>> <lawrence.michael at gene.com
>> <mailto:lawrence.michael at gene.com>
>> <mailto:lawrence.michael at gene.__com
>> <mailto:lawrence.michael at gene.com>>> wrote:
>>
>> Yes, it's very clear that the scaling is non-linear,
>> and Gabe has
>> been experimenting with a chunk-wise + parallel
>> algorithm.
>> Unfortunately there is some frustrating overhead with
>> the
>> parallelism. But I'm glad Val is arriving at something
>> quicker.
>>
>> Michael
>>
>>
>> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
>> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>> <mailto:mtmorgan at fhcrc.org
>> <mailto:mtmorgan at fhcrc.org>>> wrote:
>>
>> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>>
>> The profiling I attached in my previous email
>> is for 24 geno
>> fields, as I said,
>> but our typical usecase involves only ~4-6
>> fields, and is
>> faster but still on
>> the order of dozens of minutes.
>>
>>
>> I think Val is arriving at a (much) more efficient
>> implementation, but...
>>
>> I wanted to share my guess that the poor _scaling_
>> is because
>> the garbage collector runs multiple times as the
>> different
>> strings are pasted together, and has to traverse,
>> in linear
>> time, increasing numbers of allocated SEXPs. So
>> times scale
>> approximately quadratically with the number of rows
>> in the VCF
>>
>> An efficiency is to reduce the number of SEXPs in
>> play by
>> writing out in chunks -- as each chunk is written,
>> the SEXPs
>> become available for collection and are re-used.
>> Here's my toy
>> example
>>
>> time.R
>> ======
>> splitIndices <- function (nx, ncl)
>> {
>> i <- seq_len(nx)
>> if (ncl == 0L)
>> list()
>> else if (ncl == 1L || nx == 1L)
>> list(i)
>> else {
>> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>> breaks <- seq(1 - fuzz, nx + fuzz, length
>> = ncl + 1L)
>> structure(split(i, cut(i, breaks,
>> labels=FALSE)), names
>> = NULL)
>> }
>> }
>>
>> x = as.character(seq_len(1e7)); y = sample(x)
>> if (!is.na <http://is.na>
>> <http://is.na>(Sys.getenv("__SPLIT", NA))) {
>> idx <- splitIndices(length(x), 20)
>> system.time(for (i in idx) paste(x[i], y[i],
>> sep=":"))
>> } else {
>> system.time(paste(x, y, sep=":"))
>> }
>>
>>
>> running under R-devel with $ SPLIT=TRUE R --no-save
>> --quiet -f
>> time.R the relevant time is
>>
>> user system elapsed
>> 15.320 0.064 15.381
>>
>> versus with $ R --no-save --quiet -f time.R it is
>>
>> user system elapsed
>> 95.360 0.164 95.511
>>
>> I think this is likely an overall strategy when
>> dealing with
>> character data -- processing in independent chunks
>> of moderate
>> (1M?) size (enabling as a consequence parallel
>> evaluation in
>> modest memory) that are sufficient to benefit from
>> vectorization, but that do not entail allocation of
>> large
>> numbers of in-use SEXPs.
>>
>> Martin
>>
>>
>> Sorry for the confusion.
>> ~G
>>
>>
>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
>> <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
>> <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
>> <mailto:beckerg4 at gene.com
>> <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
>> <mailto:beckerg4 at gene.com>>>> wrote:
>>
>> Martin and Val.
>>
>> I re-ran writeVcf on our (G)VCF data
>> (34790518 ranges,
>> 24 geno fields) with
>> profiling enabled. The results of
>> summaryRprof for that
>> run are attached,
>> though for a variety of reasons they are
>> pretty
>> misleading.
>>
>> It took over an hour to write
>> (3700+seconds), so it's
>> definitely a
>> bottleneck when the data get very large,
>> even if it
>> isn't for smaller data.
>>
>> Michael and I both think the culprit is
>> all the pasting
>> and cbinding that is
>> going on, and more to the point, that
>> memory for an
>> internal representation
>> to be written out is allocated at all.
>> Streaming
>> across the object, looping
>> by rows and writing directly to file (e.g.
>> from C)
>> should be blisteringly
>> fast in comparison.
>>
>> ~G
>>
>>
>> On Tue, Aug 26, 2014 at 11:57 AM, Michael
>> Lawrence
>> <michafla at gene.com <mailto:michafla at gene.com>
>> <mailto:michafla at gene.com <mailto:michafla at gene.com>>
>> <mailto:michafla at gene.com
>> <mailto:michafla at gene.com> <mailto:michafla at gene.com
>> <mailto:michafla at gene.com>>>>
>> wrote:
>>
>> Gabe is still testing/profiling, but
>> we'll send
>> something randomized
>> along eventually.
>>
>>
>> On Tue, Aug 26, 2014 at 11:15 AM,
>> Martin Morgan
>> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>> <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>> <mailto:mtmorgan at fhcrc.org
>> <mailto:mtmorgan at fhcrc.org>
>> <mailto:mtmorgan at fhcrc.org
>> <mailto:mtmorgan at fhcrc.org>>>> wrote:
>>
>> I didn't see in the original
>> thread a
>> reproducible (simulated, I
>> guess) example, to be explicit
>> about what the
>> problem is??
>>
>> Martin
>>
>>
>> On 08/26/2014 10:47 AM, Michael
>> Lawrence wrote:
>>
>> My understanding is that the
>> heap
>> optimization provided marginal
>> gains, and
>> that we need to think harder
>> about how to
>> optimize the all of
>> the string
>> manipulation in writeVcf. We
>> either need to
>> reduce it or reduce its
>> overhead (i.e., the CHARSXP
>> allocation).
>> Gabe is doing more tests.
>>
>>
>> On Tue, Aug 26, 2014 at 9:43
>> AM, Valerie
>> Obenchain
>> <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>>>>
>>
>> wrote:
>>
>> Hi Gabe,
>>
>> Martin responded, and so
>> did Michael,
>>
>>
>>
>> https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
>>
>>
>> <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>
>>
>>
>>
>> <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
>>
>>
>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>
>>
>>
>>
>>
>> <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
>>
>>
>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
>>
>>
>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
>>
>> <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>
>>
>> It sounded like Michael
>> was ok with
>> working with/around heap
>> initialization.
>>
>> Michael, is that right or
>> should we
>> still consider this on
>> the table?
>>
>>
>> Val
>>
>>
>> On 08/26/2014 09:34 AM,
>> Gabe Becker
>> wrote:
>>
>> Val,
>>
>> Has there been any
>> movement on
>> this? This remains a
>> substantial
>> bottleneck for us when
>> writing very
>> large VCF files (e.g.
>> variants+genotypes for
>> whole genome
>> NGS samples).
>>
>> I was able to see a
>> ~25% speedup
>> with 4 cores and an
>> "optimal" speedup
>> of ~2x with 10-12
>> cores for a VCF
>> with 500k rows using
>> a very naive
>> parallelization
>> strategy and no
>> other changes. I suspect
>> this could be
>> improved on quite a
>> bit, or
>> possibly made irrelevant
>> with judicious use
>> of serial C code.
>>
>> Did you and Martin
>> make any plans
>> regarding optimizing
>> writeVcf?
>>
>> Best
>> ~G
>>
>>
>> On Tue, Aug 5, 2014 at
>> 2:33 PM,
>> Valerie Obenchain
>> <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>>>
>>
>> <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>>>>>
>>
>> wrote:
>>
>> Hi Michael,
>>
>> I'm interested
>> in working on
>> this. I'll discuss
>> with Martin next
>> week when we're
>> both back in
>> the office.
>>
>> Val
>>
>>
>>
>>
>>
>> On 08/05/14
>> 07:46, Michael
>> Lawrence wrote:
>>
>> Hi guys
>> (Val, Martin,
>> Herve):
>>
>> Anyone have
>> an itch for
>> optimization? The
>> writeVcf function is
>> currently a
>> bottleneck
>> in our WGS
>> genotyping pipeline. For
>> a typical 50
>> million row
>> gVCF, it was
>> taking 2.25
>> hours prior to
>> yesterday's
>> improvements
>>
>> (pasteCollapseRows) that
>> brought it down to
>> about 1 hour, which
>> is still
>> too long by
>> my standards
>> (> 0). Only takes 3
>> minutes to call the
>> genotypes
>> (and
>> associated
>> likelihoods etc) from the
>> variant calls (using
>> 80 cores and
>> 450 GB RAM
>> on one node),
>> so the output is an
>> issue. Profiling
>> suggests that
>> the running
>> time scales
>> non-linearly in the
>> number of rows.
>>
>> Digging a
>> little deeper,
>> it seems to be
>> something with R's
>> string/memory
>> allocation.
>> Below,
>> pasting 1 million strings
>> takes 6 seconds, but
>> 10
>> million
>> strings takes
>> over 2 minutes. It gets
>> way worse with 50
>> million. I
>> suspect it
>> has something
>> to do with R's string
>> hash table.
>>
>>
>> set.seed(1000)
>> end <-
>> sample(1e8, 1e6)
>>
>> system.time(paste0("END",
>> "=", end))
>> user
>> system elapsed
>> 6.396
>> 0.028 6.420
>>
>> end <-
>> sample(1e8, 1e7)
>>
>> system.time(paste0("END",
>> "=", end))
>> user
>> system elapsed
>> 134.714
>> 0.352 134.978
>>
>> Indeed, even
>> this takes a
>> long time (in a
>> fresh session):
>>
>>
>> set.seed(1000)
>> end <-
>> sample(1e8, 1e6)
>> end <-
>> sample(1e8, 1e7)
>>
>> system.time(as.character(end))
>> user
>> system elapsed
>> 57.224
>> 0.156 57.366
>>
>> But running
>> it a second
>> time is faster (about
>> what one would
>> expect?):
>>
>>
>> system.time(levels <-
>> as.character(end))
>> user
>> system elapsed
>> 23.582
>> 0.021 23.589
>>
>> I did some
>> simple
>> profiling of R to find that
>> the resizing of
>> the string
>> hash table
>> is not a
>> significant component of
>> the time. So maybe
>> something
>> to do with
>> the R heap/gc?
>> No time right now to
>> go deeper. But I
>> know Martin
>> likes this
>> sort of
>> thing ;)
>>
>> Michael
>>
>>
>> [[alternative
>> HTML version deleted]]
>>
>>
>>
>>
>> _______________________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>
>>
>> <mailto:Bioc-devel at r-project.
>> <mailto:Bioc-devel at r-project.>____org
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>>
>>
>> <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
>> <mailto:Bioc-devel at r-project
>> <mailto:Bioc-devel at r-project>.>______org
>>
>> <mailto:Bioc-devel at r-project.
>> <mailto:Bioc-devel at r-project.>____org
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>>>
>> mailing list
>>
>> https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>>
>>
>> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>>
>>
>>
>> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>>
>>
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>
>>
>>
>>
>>
>> _______________________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>
>>
>> <mailto:Bioc-devel at r-project.
>> <mailto:Bioc-devel at r-project.>____org
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>>
>>
>> <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
>> <mailto:Bioc-devel at r-project
>> <mailto:Bioc-devel at r-project>.>______org
>>
>> <mailto:Bioc-devel at r-project.
>> <mailto:Bioc-devel at r-project.>____org
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>>> mailing
>> list
>> https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>>
>>
>> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>>
>>
>>
>>
>>
>> <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>>
>>
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>>
>>
>>
>>
>> [[alternative HTML
>> version
>> deleted]]
>>
>>
>>
>> _____________________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>
>> <mailto:Bioc-devel at r-project.
>> <mailto:Bioc-devel at r-project.>____org
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>>
>> mailing list
>> https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>>
>>
>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
>>
>>
>>
>> --
>> Computational Biology / Fred
>> Hutchinson Cancer
>> Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>> <tel:%28206%29%20667-2793>
>> <tel:%28206%29%20667-2793>
>> <tel:%28206%29%20667-2793>
>>
>>
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer
>> Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>> <tel:%28206%29%20667-2793>
>>
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>>
>> _________________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list