[Bioc-devel] writeVcf performance
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Fri Sep 5 15:44:55 CEST 2014
This approach, writing in chunks, is the same Herve and I used for writing
FASTA in the Biostrings package, although I see that Herve has now replaced
the R implementation with a C implementation. I similarly found an
absolutely huge speed up when writing genomes, by chunking.
Best,
Kasper
On Tue, Sep 2, 2014 at 4:33 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
>> The profiling I attached in my previous email is for 24 geno fields, as I
>> said,
>> but our typical usecase involves only ~4-6 fields, and is faster but
>> still on
>> the order of dozens of minutes.
>>
>
> I think Val is arriving at a (much) more efficient implementation, but...
>
> I wanted to share my guess that the poor _scaling_ is because the garbage
> collector runs multiple times as the different strings are pasted together,
> and has to traverse, in linear time, increasing numbers of allocated SEXPs.
> So times scale approximately quadratically with the number of rows in the
> VCF
>
> An efficiency is to reduce the number of SEXPs in play by writing out in
> chunks -- as each chunk is written, the SEXPs become available for
> collection and are re-used. Here's my toy example
>
> time.R
> ======
> splitIndices <- function (nx, ncl)
> {
> i <- seq_len(nx)
> if (ncl == 0L)
> list()
> else if (ncl == 1L || nx == 1L)
> list(i)
> else {
> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
> structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
> }
> }
>
> x = as.character(seq_len(1e7)); y = sample(x)
> if (!is.na(Sys.getenv("SPLIT", NA))) {
> idx <- splitIndices(length(x), 20)
> system.time(for (i in idx) paste(x[i], y[i], sep=":"))
> } else {
> system.time(paste(x, y, sep=":"))
> }
>
>
> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
> relevant time is
>
> user system elapsed
> 15.320 0.064 15.381
>
> versus with $ R --no-save --quiet -f time.R it is
>
> user system elapsed
> 95.360 0.164 95.511
>
> I think this is likely an overall strategy when dealing with character
> data -- processing in independent chunks of moderate (1M?) size (enabling
> as a consequence parallel evaluation in modest memory) that are sufficient
> to benefit from vectorization, but that do not entail allocation of large
> numbers of in-use SEXPs.
>
> Martin
>
>
>> Sorry for the confusion.
>> ~G
>>
>>
>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
>> <mailto:beckerg4 at gene.com>> wrote:
>>
>> Martin and Val.
>>
>> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
>> fields) with
>> profiling enabled. The results of summaryRprof for that run are
>> attached,
>> though for a variety of reasons they are pretty misleading.
>>
>> It took over an hour to write (3700+seconds), so it's definitely a
>> bottleneck when the data get very large, even if it isn't for smaller
>> data.
>>
>> Michael and I both think the culprit is all the pasting and cbinding
>> that is
>> going on, and more to the point, that memory for an internal
>> representation
>> to be written out is allocated at all. Streaming across the object,
>> looping
>> by rows and writing directly to file (e.g. from C) should be
>> blisteringly
>> fast in comparison.
>>
>> ~G
>>
>>
>> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com
>> <mailto:michafla at gene.com>> wrote:
>>
>> Gabe is still testing/profiling, but we'll send something
>> randomized
>> along eventually.
>>
>>
>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
>> mtmorgan at fhcrc.org
>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>
>> I didn't see in the original thread a reproducible
>> (simulated, I
>> guess) example, to be explicit about what the problem is??
>>
>> Martin
>>
>>
>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>
>> My understanding is that the heap optimization provided
>> marginal
>> gains, and
>> that we need to think harder about how to optimize the
>> all of
>> the string
>> manipulation in writeVcf. We either need to reduce it or
>> reduce its
>> overhead (i.e., the CHARSXP allocation). Gabe is doing
>> more tests.
>>
>>
>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
>> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>> wrote:
>>
>> Hi Gabe,
>>
>> Martin responded, and so did Michael,
>>
>> https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
>> August/006082.html
>>
>> <https://stat.ethz.ch/pipermail/bioc-devel/2014-
>> August/006082.html>
>>
>> It sounded like Michael was ok with working
>> with/around heap
>> initialization.
>>
>> Michael, is that right or should we still consider
>> this on
>> the table?
>>
>>
>> Val
>>
>>
>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>
>> Val,
>>
>> Has there been any movement on this? This remains
>> a
>> substantial
>> bottleneck for us when writing very large VCF
>> files (e.g.
>> variants+genotypes for whole genome NGS samples).
>>
>> I was able to see a ~25% speedup with 4 cores
>> and an
>> "optimal" speedup
>> of ~2x with 10-12 cores for a VCF with 500k rows
>> using
>> a very naive
>> parallelization strategy and no other changes. I
>> suspect
>> this could be
>> improved on quite a bit, or possibly made
>> irrelevant
>> with judicious use
>> of serial C code.
>>
>> Did you and Martin make any plans regarding
>> optimizing
>> writeVcf?
>>
>> Best
>> ~G
>>
>>
>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
>> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>> <mailto:vobencha at fhcrc.org <mailto:
>> vobencha at fhcrc.org>>>
>>
>> wrote:
>>
>> Hi Michael,
>>
>> I'm interested in working on this. I'll
>> discuss
>> with Martin next
>> week when we're both back in the office.
>>
>> Val
>>
>>
>>
>>
>>
>> On 08/05/14 07:46, Michael Lawrence wrote:
>>
>> Hi guys (Val, Martin, Herve):
>>
>> Anyone have an itch for optimization?
>> The
>> writeVcf function is
>> currently a
>> bottleneck in our WGS genotyping
>> pipeline. For
>> a typical 50
>> million row
>> gVCF, it was taking 2.25 hours prior to
>> yesterday's improvements
>> (pasteCollapseRows) that brought it
>> down to
>> about 1 hour, which
>> is still
>> too long by my standards (> 0). Only
>> takes 3
>> minutes to call the
>> genotypes
>> (and associated likelihoods etc) from
>> the
>> variant calls (using
>> 80 cores and
>> 450 GB RAM on one node), so the output
>> is an
>> issue. Profiling
>> suggests that
>> the running time scales non-linearly in
>> the
>> number of rows.
>>
>> Digging a little deeper, it seems to be
>> something with R's
>> string/memory
>> allocation. Below, pasting 1 million
>> strings
>> takes 6 seconds, but
>> 10
>> million strings takes over 2 minutes.
>> It gets
>> way worse with 50
>> million. I
>> suspect it has something to do with R's
>> string
>> hash table.
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> system.time(paste0("END", "=", end))
>> user system elapsed
>> 6.396 0.028 6.420
>>
>> end <- sample(1e8, 1e7)
>> system.time(paste0("END", "=", end))
>> user system elapsed
>> 134.714 0.352 134.978
>>
>> Indeed, even this takes a long time (in
>> a
>> fresh session):
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> end <- sample(1e8, 1e7)
>> system.time(as.character(end))
>> user system elapsed
>> 57.224 0.156 57.366
>>
>> But running it a second time is faster
>> (about
>> what one would
>> expect?):
>>
>> system.time(levels <- as.character(end))
>> user system elapsed
>> 23.582 0.021 23.589
>>
>> I did some simple profiling of R to
>> find that
>> the resizing of
>> the string
>> hash table is not a significant
>> component of
>> the time. So maybe
>> something
>> to do with the R heap/gc? No time right
>> now to
>> go deeper. But I
>> know Martin
>> likes this sort of thing ;)
>>
>> Michael
>>
>> [[alternative HTML version
>> deleted]]
>>
>>
>> ______________________________
>> _____________________
>> Bioc-devel at r-project.org
>> <mailto:Bioc-devel at r-project.org>
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>>
>> mailing list
>> https://stat.ethz.ch/mailman/_
>> ___listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/
>> __listinfo/bioc-devel>
>>
>> <https://stat.ethz.ch/mailman/
>> __listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>>
>>
>> ______________________________
>> _____________________
>> Bioc-devel at r-project.org
>> <mailto:Bioc-devel at r-project.org>
>> <mailto:Bioc-devel at r-project.__org
>> <mailto:Bioc-devel at r-project.org>> mailing
>> list
>> https://stat.ethz.ch/mailman/_
>> ___listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/
>> __listinfo/bioc-devel>
>>
>>
>> <https://stat.ethz.ch/mailman/
>> __listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _________________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org
>> >
>> mailing list
>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list