[Bioc-devel] writeVcf performance

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Fri Sep 5 15:44:55 CEST 2014


This approach, writing in chunks, is the same Herve and I used for writing
FASTA in the Biostrings package, although I see that Herve has now replaced
the R implementation with a C implementation.  I similarly found an
absolutely huge speed up when writing genomes, by chunking.

Best,
Kasper


On Tue, Sep 2, 2014 at 4:33 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
>> The profiling I attached in my previous email is for 24 geno fields, as I
>> said,
>> but our typical usecase involves only ~4-6 fields, and is faster but
>> still on
>> the order of dozens of minutes.
>>
>
> I think Val is arriving at a (much) more efficient implementation, but...
>
> I wanted to share my guess that the poor _scaling_ is because the garbage
> collector runs multiple times as the different strings are pasted together,
> and has to traverse, in linear time, increasing numbers of allocated SEXPs.
> So times scale approximately quadratically with the number of rows in the
> VCF
>
> An efficiency is to reduce the number of SEXPs in play by writing out in
> chunks -- as each chunk is written, the SEXPs become available for
> collection and are re-used. Here's my toy example
>
> time.R
> ======
> splitIndices <- function (nx, ncl)
> {
>     i <- seq_len(nx)
>     if (ncl == 0L)
>         list()
>     else if (ncl == 1L || nx == 1L)
>         list(i)
>     else {
>         fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>         breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>         structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
>     }
> }
>
> x = as.character(seq_len(1e7)); y = sample(x)
> if (!is.na(Sys.getenv("SPLIT", NA))) {
>     idx <- splitIndices(length(x), 20)
>     system.time(for (i in idx) paste(x[i], y[i], sep=":"))
> } else {
>     system.time(paste(x, y, sep=":"))
> }
>
>
> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
> relevant time is
>
>    user  system elapsed
>  15.320   0.064  15.381
>
> versus with $ R --no-save --quiet -f time.R it is
>
>    user  system elapsed
>  95.360   0.164  95.511
>
> I think this is likely an overall strategy when dealing with character
> data -- processing in independent chunks of moderate (1M?) size (enabling
> as a consequence parallel evaluation in modest memory) that are sufficient
> to benefit from vectorization, but that do not entail allocation of large
> numbers of in-use SEXPs.
>
> Martin
>
>
>> Sorry for the confusion.
>> ~G
>>
>>
>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
>> <mailto:beckerg4 at gene.com>> wrote:
>>
>>     Martin and Val.
>>
>>     I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
>> fields) with
>>     profiling enabled. The results of summaryRprof for that run are
>> attached,
>>     though for a variety of reasons they are pretty misleading.
>>
>>     It took over an hour to write (3700+seconds), so it's definitely a
>>     bottleneck when the data get very large, even if it isn't for smaller
>> data.
>>
>>     Michael and I both think the culprit is all the pasting and cbinding
>> that is
>>     going on, and more to the point, that memory for an internal
>> representation
>>     to be written out is allocated at all.  Streaming across the object,
>> looping
>>     by rows and writing directly to file (e.g. from C) should be
>> blisteringly
>>     fast in comparison.
>>
>>     ~G
>>
>>
>>     On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com
>>     <mailto:michafla at gene.com>> wrote:
>>
>>         Gabe is still testing/profiling, but we'll send something
>> randomized
>>         along eventually.
>>
>>
>>         On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
>> mtmorgan at fhcrc.org
>>         <mailto:mtmorgan at fhcrc.org>> wrote:
>>
>>             I didn't see in the original thread a reproducible
>> (simulated, I
>>             guess) example, to be explicit about what the problem is??
>>
>>             Martin
>>
>>
>>             On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>
>>                 My understanding is that the heap optimization provided
>> marginal
>>                 gains, and
>>                 that we need to think harder about how to optimize the
>> all of
>>                 the string
>>                 manipulation in writeVcf. We either need to reduce it or
>> reduce its
>>                 overhead (i.e., the CHARSXP allocation). Gabe is doing
>> more tests.
>>
>>
>>                 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
>>                 <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>>                 wrote:
>>
>>                     Hi Gabe,
>>
>>                     Martin responded, and so did Michael,
>>
>>                     https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
>> August/006082.html
>>
>>                     <https://stat.ethz.ch/pipermail/bioc-devel/2014-
>> August/006082.html>
>>
>>                     It sounded like Michael was ok with working
>> with/around heap
>>                     initialization.
>>
>>                     Michael, is that right or should we still consider
>> this on
>>                     the table?
>>
>>
>>                     Val
>>
>>
>>                     On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>
>>                         Val,
>>
>>                         Has there been any movement on this? This remains
>> a
>>                         substantial
>>                         bottleneck for us when writing very large VCF
>> files (e.g.
>>                         variants+genotypes for whole genome NGS samples).
>>
>>                         I was able to see a ~25% speedup with 4 cores
>> and  an
>>                         "optimal" speedup
>>                         of ~2x with 10-12 cores for a VCF with 500k rows
>> using
>>                         a very naive
>>                         parallelization strategy and no other changes. I
>> suspect
>>                         this could be
>>                         improved on quite a bit, or possibly made
>> irrelevant
>>                         with judicious use
>>                         of serial C code.
>>
>>                         Did you and Martin make any plans regarding
>> optimizing
>>                         writeVcf?
>>
>>                         Best
>>                         ~G
>>
>>
>>                         On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
>>                         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>                         <mailto:vobencha at fhcrc.org <mailto:
>> vobencha at fhcrc.org>>>
>>
>>                         wrote:
>>
>>                               Hi Michael,
>>
>>                               I'm interested in working on this. I'll
>> discuss
>>                         with Martin next
>>                               week when we're both back in the office.
>>
>>                               Val
>>
>>
>>
>>
>>
>>                               On 08/05/14 07:46, Michael Lawrence wrote:
>>
>>                                   Hi guys (Val, Martin, Herve):
>>
>>                                   Anyone have an itch for optimization?
>> The
>>                         writeVcf function is
>>                                   currently a
>>                                   bottleneck in our WGS genotyping
>> pipeline. For
>>                         a typical 50
>>                                   million row
>>                                   gVCF, it was taking 2.25 hours prior to
>>                         yesterday's improvements
>>                                   (pasteCollapseRows) that brought it
>> down to
>>                         about 1 hour, which
>>                                   is still
>>                                   too long by my standards (> 0). Only
>> takes 3
>>                         minutes to call the
>>                                   genotypes
>>                                   (and associated likelihoods etc) from
>> the
>>                         variant calls (using
>>                                   80 cores and
>>                                   450 GB RAM on one node), so the output
>> is an
>>                         issue. Profiling
>>                                   suggests that
>>                                   the running time scales non-linearly in
>> the
>>                         number of rows.
>>
>>                                   Digging a little deeper, it seems to be
>>                         something with R's
>>                                   string/memory
>>                                   allocation. Below, pasting 1 million
>> strings
>>                         takes 6 seconds, but
>>                         10
>>                                   million strings takes over 2 minutes.
>> It gets
>>                         way worse with 50
>>                                   million. I
>>                                   suspect it has something to do with R's
>> string
>>                         hash table.
>>
>>                                   set.seed(1000)
>>                                   end <- sample(1e8, 1e6)
>>                                   system.time(paste0("END", "=", end))
>>                                        user  system elapsed
>>                                       6.396   0.028   6.420
>>
>>                                   end <- sample(1e8, 1e7)
>>                                   system.time(paste0("END", "=", end))
>>                                        user  system elapsed
>>                                   134.714   0.352 134.978
>>
>>                                   Indeed, even this takes a long time (in
>> a
>>                         fresh session):
>>
>>                                   set.seed(1000)
>>                                   end <- sample(1e8, 1e6)
>>                                   end <- sample(1e8, 1e7)
>>                                   system.time(as.character(end))
>>                                        user  system elapsed
>>                                      57.224   0.156  57.366
>>
>>                                   But running it a second time is faster
>> (about
>>                         what one would
>>                                   expect?):
>>
>>                                   system.time(levels <- as.character(end))
>>                                        user  system elapsed
>>                                      23.582   0.021  23.589
>>
>>                                   I did some simple profiling of R to
>> find that
>>                         the resizing of
>>                                   the string
>>                                   hash table is not a significant
>> component of
>>                         the time. So maybe
>>                                   something
>>                                   to do with the R heap/gc? No time right
>> now to
>>                         go deeper. But I
>>                                   know Martin
>>                                   likes this sort of thing ;)
>>
>>                                   Michael
>>
>>                                            [[alternative HTML version
>> deleted]]
>>
>>
>>                           ______________________________
>> _____________________
>>                         Bioc-devel at r-project.org
>>                         <mailto:Bioc-devel at r-project.org>
>>                         <mailto:Bioc-devel at r-project.__org
>>                         <mailto:Bioc-devel at r-project.org>>
>>                                   mailing list
>>                         https://stat.ethz.ch/mailman/_
>> ___listinfo/bioc-devel
>>                         <https://stat.ethz.ch/mailman/
>> __listinfo/bioc-devel>
>>
>>                           <https://stat.ethz.ch/mailman/
>> __listinfo/bioc-devel
>>                         <https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>>
>>
>>                               ______________________________
>> _____________________
>>                         Bioc-devel at r-project.org
>>                         <mailto:Bioc-devel at r-project.org>
>>                         <mailto:Bioc-devel at r-project.__org
>>                         <mailto:Bioc-devel at r-project.org>> mailing
>>                         list
>>                         https://stat.ethz.ch/mailman/_
>> ___listinfo/bioc-devel
>>                         <https://stat.ethz.ch/mailman/
>> __listinfo/bioc-devel>
>>
>>
>>                           <https://stat.ethz.ch/mailman/
>> __listinfo/bioc-devel
>>                         <https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>>
>>
>>
>>
>>                         --
>>                         Computational Biologist
>>                         Genentech Research
>>
>>
>>
>>
>>
>>                          [[alternative HTML version deleted]]
>>
>>                 _________________________________________________
>>                 Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org
>> >
>>                 mailing list
>>                 https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>                 <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
>>
>>             --
>>             Computational Biology / Fred Hutchinson Cancer Research Center
>>             1100 Fairview Ave. N.
>>             PO Box 19024 Seattle, WA 98109
>>
>>             Location: Arnold Building M1 B861
>>             Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>
>>
>>
>>
>>
>>     --
>>     Computational Biologist
>>     Genentech Research
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list