[Bioc-devel] writeVcf performance

Valerie Obenchain vobencha at fhcrc.org
Tue Sep 30 22:09:34 CEST 2014


Hi Gabe,

It would help to have a common baseline. Please show the output for 
writing the Illumina file you sent originally:

library(VariantAnnotation)
fl <- "NA12877_S1.genome.vcf.gz"
vcf <- readVcf(fl, "", param=ScanVcfParam(info=NA))
dim(vcf)

gc()
print(system.time(writeVcf(vcf, tempfile())))
gc()

Here's my output. For this file, writing takes between 8-10 minutes 
depending on machine load. (Machine has 378 GIG of RAM.)

>> dim(vcf)
> [1] 51612762        1

>> gc()
>              used   (Mb) gc trigger    (Mb)   max used    (Mb)
> Ncells  157824501 8428.8  298615851 15947.9  261241334 13951.8
> Vcells 1109937571 8468.2 1778479081 13568.8 1693642432 12921.5
>> print(system.time(writeVcf(vcf, tempfile())))
>    user  system elapsed
> 601.265   9.261 614.230
>> gc()
>              used   (Mb) gc trigger    (Mb)   max used    (Mb)
> Ncells  157829989 8429.1  298615851 15947.9  261241334 13951.8
> Vcells 1109941106 8468.2 2270282109 17320.9 2243573149 17117.2

I'm not seeing substantial differences in writeVcf with expanded vs 
collapsed VCF. However, as an aside, note that calling writeVcf on 
VRanges (or equivalently ExpandedVCF) the new file will have multiple 
rows per position if there were multiple ALT alleles. I'm not sure this 
is technically valid as per the VCF specs but am assuming Michael knew 
about this.

Five positions in original VCF:
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "")
 > dim(vcf)
[1] 5 3

vr <- as(vcf, "VRanges")
 > length(vr)
[1] 21

Before writing out, VRanges is coerced to ExpandedVCF; 7 rows are 
written out.
 > dim(as(vr, "VCF"))
[1] 7 3

Can you provide a portion of 'vcfgeno' for testing?

Valerie

>> sessionInfo()
> R version 3.1.0 Patched (2014-06-24 r66016)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
...

> other attached packages:
> [1] VariantAnnotation_1.11.35 Rsamtools_1.17.34
> [3] Biostrings_2.33.14        XVector_0.5.8
> [5] GenomicRanges_1.17.42     GenomeInfoDb_1.1.23
> [7] IRanges_1.99.28           S4Vectors_0.2.4
> [9] BiocGenerics_0.11.5



On 09/30/2014 08:36 AM, Gabe Becker wrote:
> Valerie,
>
> Apologies for this taking much longer than it should have. The changes
> in Bioc-devel have wreaked havoc on the code we use to to generate and
> process the data we need to write out, but the fault is mine for not
> getting on top of it sooner.
>
> I'm not seeing the speed you mentioned above in the latest devel version
> (1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
> (print output and sessionInfo() follow). I'll try reading in the
> illumina platinum and writing it back out to see if it is something
> about our specific vcf object (could ExpandedVCF vs VCF be an issue?).
>
>  > vcfgeno
> *class: ExpandedVCF *
> *dim: 50307989 1 *
> rowData(vcf):
>    GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
> info(vcf):
>    DataFrame with 1 column: END
>    Fields with no header: END
> geno(vcf):
>    SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
> geno(header(vcf)):
>            Number Type    Description
>     AD     2      Integer Allelic depths (number of reads in each
> observed al...
>     DP     1      Integer Total read depth
>     FT     1      String  Variant filters
>     GT     1      String  Genotype
>     GQ     1      Integer Genotype quality
>     PL     3      Integer Normalized, Phred-scaled likelihoods for
> genotypes
>     MIN_DP 1      Integer Minimum DP observed within the GVCF block
>  > sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats4    parallel  stats     graphics  grDevices utils     datasets
> [8] methods   base
>
> other attached packages:
>   [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
>   [3] AnnotationDbi_1.27.16     Biobase_2.25.0
>   [5] gmapR_1.7.8               VTGenotyping_0.0.1
>   [7] BiocParallel_0.99.22      futile.logger_1.3.7
>   [9] VariantTools_1.7.5 *VariantAnnotation_1.11.35*
> [11] Rsamtools_1.17.34         Biostrings_2.33.14
> [13] XVector_0.5.8             rtracklayer_1.25.16
> [15] GenomicRanges_1.17.42     GenomeInfoDb_1.1.23
> [17] IRanges_1.99.28           S4Vectors_0.2.4
> [19] BiocGenerics_0.11.5       switchr_0.2.1
>
> loaded via a namespace (and not attached):
>   [1] annotate_1.43.5                base64enc_0.1-2
>   [3] BatchJobs_1.4                  BBmisc_1.7
>   [5] biomaRt_2.21.1                 bitops_1.0-6
>   [7] brew_1.0-6                     BSgenome_1.33.9
>   [9] CGPtools_2.2.0                 checkmate_1.4
> [11] codetools_0.2-9                DBI_0.3.1
> [13] DESeq_1.17.0                   digest_0.6.4
> [15] fail_1.2                       foreach_1.4.2
> [17] futile.options_1.0.0           genefilter_1.47.6
> [19] geneplotter_1.43.0             GenomicAlignments_1.1.29
> [21] genoset_1.19.32                gneDB_0.4.18
> [23] grid_3.1.1                     iterators_1.0.7
> [25] lambda.r_1.1.6                 lattice_0.20-29
> [27] Matrix_1.1-4                   RColorBrewer_1.0-5
> [29] RCurl_1.95-4.3                 rjson_0.2.14
> [31] RSQLite_0.11.4                 sendmailR_1.2-1
> [33] splines_3.1.1                  stringr_0.6.2
> [35] survival_2.37-7                tools_3.1.1
> [37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1
> [39] xtable_1.7-4                   zlibbioc_1.11.1
>
> On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Hi Gabe,
>
>     Have you had a chance to test writeVcf? The changes made over the
>     past week have shaved off more time. It now takes ~ 9 minutes to
>     write the NA12877 example.
>
>             dim(vcf)
>
>         [1] 51612762        1
>
>             gc()
>
>                       used   (Mb) gc trigger    (Mb)   max used    (Mb)
>         Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
>         Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8
>
>             print(system.time(writeVcf(__vcf, tempfile())))
>
>             user  system elapsed
>         555.282   6.700 565.700
>
>             gc()
>
>                       used   (Mb) gc trigger    (Mb)   max used    (Mb)
>         Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
>         Vcells 1176960717 8979.5 2183277445 <tel:2183277445> 16657.1
>         2171401955 16566.5
>
>
>
>     In the most recent version (1.11.35) I've added chunking for files
>     with > 1e5 records. Right now the choice of # records per chunk is
>     simple, based on total records only. We are still experimenting with
>     this. You can override default chunking with 'nchunk'. Examples on
>     the man page.
>
>     Valerie
>
>
>     On 09/08/14 08:43, Gabe Becker wrote:
>
>         Val,
>
>         That is great. I'll check this out and test it on our end.
>
>         ~G
>
>         On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain
>         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>         <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:
>
>              The new writeVcf code is in 1.11.28.
>
>              Using the illumina file you suggested, geno fields only,
>         writing now
>              takes about 17 minutes.
>
>               > hdr
>              class: VCFHeader
>              samples(1): NA12877
>              meta(6): fileformat ApplyRecalibration ... reference source
>              fixed(1): FILTER
>              info(22): AC AF ... culprit set
>              geno(8): GT GQX ... PL VF
>
>               > param = ScanVcfParam(info=NA)
>               > vcf = readVcf(fl, "", param=param)
>               > dim(vcf)
>              [1] 51612762        1
>
>               > system.time(writeVcf(vcf, "out.vcf"))
>                   user   system  elapsed
>                971.032    6.568 1004.593
>
>              In 1.11.28, parsing of geno data was moved to C. If this didn't
>              speed things up enough we were planning to implement 'chunking'
>              through the VCF and/or move the parsing of info to C,
>         however, it
>              looks like geno was the bottleneck.
>
>              I've tested a number of samples/fields combinations in
>         files with >=
>              .5 million rows and the improvement over writeVcf() in
>         release is ~ 90%.
>
>              Valerie
>
>
>
>
>              On 09/04/14 15:28, Valerie Obenchain wrote:
>
>                  Thanks Gabe. I should have something for you on Monday.
>
>                  Val
>
>
>                  On 09/04/2014 01:56 PM, Gabe Becker wrote:
>
>                      Val and Martin,
>
>                      Apologies for the delay.
>
>                      We realized that the Illumina platinum genome vcf
>         files make
>                      a good test
>                      case, assuming you strip out all the info (info=NA when
>                      reading it into
>                      R) stuff.
>
>         ftp://platgene:G3n3s4me@ussd-____ftp.illumina.com/NA12877_S1.____genome.vcf.gz
>         <ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz>
>
>         <ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz
>         <ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>>
>                      took about ~4.2 hrs to write out, and is about 1.5x
>         the size
>                      of the
>                      files we are actually dealing with (~50M ranges vs
>         our ~30M).
>
>                      Looking forward a new vastly improved writeVcf :).
>
>                      ~G
>
>
>                      On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
>                      <lawrence.michael at gene.com
>         <mailto:lawrence.michael at gene.com>
>                      <mailto:lawrence.michael at gene.__com
>         <mailto:lawrence.michael at gene.com>>
>                      <mailto:lawrence.michael at gene.
>         <mailto:lawrence.michael at gene.>____com
>                      <mailto:lawrence.michael at gene.__com
>         <mailto:lawrence.michael at gene.com>>>> wrote:
>
>                           Yes, it's very clear that the scaling is
>         non-linear,
>                      and Gabe has
>                           been experimenting with a chunk-wise +
>         parallel algorithm.
>                           Unfortunately there is some frustrating
>         overhead with the
>                           parallelism. But I'm glad Val is arriving at
>         something
>                      quicker.
>
>                           Michael
>
>
>                           On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
>                      <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>         <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>                           <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>
>                      <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>>>> wrote:
>
>                               On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
>                                   The profiling I attached in my
>         previous email
>                      is for 24 geno
>                                   fields, as I said,
>                                   but our typical usecase involves only ~4-6
>                      fields, and is
>                                   faster but still on
>                                   the order of dozens of minutes.
>
>
>                               I think Val is arriving at a (much) more
>         efficient
>                               implementation, but...
>
>                               I wanted to share my guess that the poor
>         _scaling_
>                      is because
>                               the garbage collector runs multiple times
>         as the
>                      different
>                               strings are pasted together, and has to
>         traverse,
>                      in linear
>                               time, increasing numbers of allocated
>         SEXPs. So
>                      times scale
>                               approximately quadratically with the
>         number of rows
>                      in the VCF
>
>                               An efficiency is to reduce the number of
>         SEXPs in
>                      play by
>                               writing out in chunks -- as each chunk is
>         written,
>                      the SEXPs
>                               become available for collection and are
>         re-used.
>                      Here's my toy
>                               example
>
>                               time.R
>                               ======
>                               splitIndices <- function (nx, ncl)
>                               {
>                                    i <- seq_len(nx)
>                                    if (ncl == 0L)
>                                        list()
>                                    else if (ncl == 1L || nx == 1L)
>                                        list(i)
>                                    else {
>                                        fuzz <- min((nx - 1L)/1000, 0.4 *
>         nx/ncl)
>                                        breaks <- seq(1 - fuzz, nx +
>         fuzz, length
>                      = ncl + 1L)
>                                        structure(split(i, cut(i, breaks,
>                      labels=FALSE)), names
>                               = NULL)
>                                    }
>                               }
>
>                               x = as.character(seq_len(1e7)); y = sample(x)
>                               if (!is.na <http://is.na> <http://is.na>
>                      <http://is.na>(Sys.getenv("____SPLIT", NA))) {
>
>                                    idx <- splitIndices(length(x), 20)
>                                    system.time(for (i in idx)
>         paste(x[i], y[i],
>                      sep=":"))
>                               } else {
>                                    system.time(paste(x, y, sep=":"))
>                               }
>
>
>                               running under R-devel with $ SPLIT=TRUE R
>         --no-save
>                      --quiet -f
>                               time.R the relevant time is
>
>                                   user  system elapsed
>                                 15.320   0.064  15.381
>
>                               versus with $ R --no-save --quiet -f
>         time.R it is
>
>                                   user  system elapsed
>                                 95.360   0.164  95.511
>
>                               I think this is likely an overall strategy
>         when
>                      dealing with
>                               character data -- processing in
>         independent chunks
>                      of moderate
>                               (1M?) size (enabling as a consequence parallel
>                      evaluation in
>                               modest memory) that are sufficient to
>         benefit from
>                               vectorization, but that do not entail
>         allocation of
>                      large
>                               numbers of in-use SEXPs.
>
>                               Martin
>
>
>                                   Sorry for the confusion.
>                                   ~G
>
>
>                                   On Wed, Aug 27, 2014 at 11:45 AM, Gabe
>         Becker
>                                   <beckerg4 at gene.com
>         <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
>         <mailto:beckerg4 at gene.com>>
>                      <mailto:beckerg4 at gene.com
>         <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
>         <mailto:beckerg4 at gene.com>>>
>                                   <mailto:beckerg4 at gene.com
>         <mailto:beckerg4 at gene.com>
>                      <mailto:beckerg4 at gene.com
>         <mailto:beckerg4 at gene.com>> <mailto:beckerg4 at gene.com
>         <mailto:beckerg4 at gene.com>
>                      <mailto:beckerg4 at gene.com
>         <mailto:beckerg4 at gene.com>>>>> wrote:
>
>                                        Martin and Val.
>
>                                        I re-ran writeVcf on our (G)VCF data
>                      (34790518 ranges,
>                                   24 geno fields) with
>                                        profiling enabled. The results of
>                      summaryRprof for that
>                                   run are attached,
>                                        though for a variety of reasons
>         they are
>                      pretty
>                      misleading.
>
>                                        It took over an hour to write
>                      (3700+seconds), so it's
>                                   definitely a
>                                        bottleneck when the data get very
>         large,
>                      even if it
>                                   isn't for smaller data.
>
>                                        Michael and I both think the
>         culprit is
>                      all the pasting
>                                   and cbinding that is
>                                        going on, and more to the point, that
>                      memory for an
>                                   internal representation
>                                        to be written out is allocated at
>         all.
>                      Streaming
>                                   across the object, looping
>                                        by rows and writing directly to
>         file (e.g.
>                      from C)
>                                   should be blisteringly
>                                        fast in comparison.
>
>                                        ~G
>
>
>                                        On Tue, Aug 26, 2014 at 11:57 AM,
>         Michael
>                      Lawrence
>                                   <michafla at gene.com
>         <mailto:michafla at gene.com> <mailto:michafla at gene.com
>         <mailto:michafla at gene.com>>
>                      <mailto:michafla at gene.com
>         <mailto:michafla at gene.com> <mailto:michafla at gene.com
>         <mailto:michafla at gene.com>>>
>                                        <mailto:michafla at gene.com
>         <mailto:michafla at gene.com>
>                      <mailto:michafla at gene.com
>         <mailto:michafla at gene.com>> <mailto:michafla at gene.com
>         <mailto:michafla at gene.com>
>                      <mailto:michafla at gene.com
>         <mailto:michafla at gene.com>>>>>
>                                   wrote:
>
>                                            Gabe is still
>         testing/profiling, but
>                      we'll send
>                                   something randomized
>                                            along eventually.
>
>
>                                            On Tue, Aug 26, 2014 at 11:15 AM,
>                      Martin Morgan
>                                   <mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org> <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>>
>                      <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org> <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>>>
>                                            <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>
>                      <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>                                   <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>
>                      <mailto:mtmorgan at fhcrc.org
>         <mailto:mtmorgan at fhcrc.org>>>>> wrote:
>
>                                                I didn't see in the
>         original thread a
>                                   reproducible (simulated, I
>                                                guess) example, to be
>         explicit
>                      about what the
>                                   problem is??
>
>                                                Martin
>
>
>                                                On 08/26/2014 10:47 AM,
>         Michael
>                      Lawrence wrote:
>
>                                                    My understanding is
>         that the heap
>                                   optimization provided marginal
>                                                    gains, and
>                                                    that we need to think
>         harder
>                      about how to
>                                   optimize the all of
>                                                    the string
>                                                    manipulation in
>         writeVcf. We
>                      either need to
>                                   reduce it or reduce its
>                                                    overhead (i.e., the
>         CHARSXP
>                      allocation).
>                                   Gabe is doing more tests.
>
>
>                                                    On Tue, Aug 26, 2014
>         at 9:43
>                      AM, Valerie
>                                   Obenchain
>                                                    <vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                                   <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>> <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                                   <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>>>>
>
>                                                    wrote:
>
>                                                        Hi Gabe,
>
>                                                        Martin responded,
>         and so
>                      did Michael,
>
>
>         https://stat.ethz.ch/________pipermail/bioc-devel/2014-________August/006082.html
>         <https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html>
>
>         <https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
>         <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>>
>
>
>
>         <https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
>         <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>
>
>         <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
>         <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>>
>
>
>
>
>         <https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
>         <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>
>
>         <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
>         <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>
>
>
>         <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
>         <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
>
>         <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
>         <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>>
>
>                                                        It sounded like
>         Michael
>                      was ok with
>                                   working with/around heap
>                                                        initialization.
>
>                                                        Michael, is that
>         right or
>                      should we
>                                   still consider this on
>                                                        the table?
>
>
>                                                        Val
>
>
>                                                        On 08/26/2014
>         09:34 AM,
>                      Gabe Becker
>                      wrote:
>
>                                                            Val,
>
>                                                            Has there
>         been any
>                      movement on
>                                   this? This remains a
>                                                            substantial
>                                                            bottleneck
>         for us when
>                      writing very
>                                   large VCF files (e.g.
>
>         variants+genotypes for
>                      whole genome
>                                   NGS samples).
>
>                                                            I was able to
>         see a
>                      ~25% speedup
>                                   with 4 cores and  an
>                                                            "optimal" speedup
>                                                            of ~2x with 10-12
>                      cores for a VCF
>                                   with 500k rows  using
>                                                            a very naive
>                                                            parallelization
>                      strategy and no
>                                   other changes. I suspect
>                                                            this could be
>                                                            improved on
>         quite a
>                      bit, or
>                                   possibly made irrelevant
>                                                            with
>         judicious use
>                                                            of serial C code.
>
>                                                            Did you and
>         Martin
>                      make any plans
>                                   regarding optimizing
>                                                            writeVcf?
>
>                                                            Best
>                                                            ~G
>
>
>                                                            On Tue, Aug
>         5, 2014 at
>                      2:33 PM,
>                                   Valerie Obenchain
>
>         <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                                   <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>> <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                                   <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>>>
>
>                        <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>
>                                   <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>> <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>                                   <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>
>                      <mailto:vobencha at fhcrc.org
>         <mailto:vobencha at fhcrc.org>>>>>__>
>
>                                                            wrote:
>
>                                                                  Hi Michael,
>
>                                                                  I'm
>         interested
>                      in working on
>                                   this. I'll discuss
>                                                            with Martin next
>                                                                  week
>         when we're
>                      both back in
>                                   the office.
>
>                                                                  Val
>
>
>
>
>
>                                                                  On 08/05/14
>                      07:46, Michael
>                                   Lawrence wrote:
>
>                                                                      Hi guys
>                      (Val, Martin,
>                      Herve):
>
>
>         Anyone have
>                      an itch for
>                                   optimization? The
>                                                            writeVcf
>         function is
>
>         currently a
>
>         bottleneck
>                      in our WGS
>                                   genotyping pipeline. For
>                                                            a typical 50
>
>         million row
>
>         gVCF, it was
>                      taking 2.25
>                                   hours prior to
>                                                            yesterday's
>         improvements
>
>                        (pasteCollapseRows) that
>                                   brought it down to
>                                                            about 1 hour,
>         which
>                                                                      is
>         still
>                                                                      too
>         long by
>                      my standards
>                                   (> 0). Only takes 3
>                                                            minutes to
>         call the
>
>         genotypes
>
>         (and associated
>                                   likelihoods etc) from the
>                                                            variant calls
>         (using
>                                                                      80
>         cores and
>                                                                      450
>         GB RAM
>                      on one node),
>                                   so the output is an
>                                                            issue. Profiling
>
>         suggests that
>                                                                      the
>         running
>                      time scales
>                                   non-linearly in the
>                                                            number of rows.
>
>
>         Digging a
>                      little deeper,
>                                   it seems to be
>                                                            something
>         with R's
>
>         string/memory
>
>         allocation.
>                      Below,
>                                   pasting 1 million strings
>                                                            takes 6
>         seconds, but
>                                                            10
>                                                                      million
>                      strings takes
>                                   over 2 minutes. It gets
>                                                            way worse with 50
>
>         million. I
>
>         suspect it
>                      has something
>                                   to do with R's string
>                                                            hash table.
>
>
>         set.seed(1000)
>                                                                      end <-
>                      sample(1e8, 1e6)
>
>                        system.time(paste0("END",
>                                   "=", end))
>
>           user
>                      system elapsed
>
>         6.396
>                        0.028   6.420
>
>                                                                      end <-
>                      sample(1e8, 1e7)
>
>                        system.time(paste0("END",
>                                   "=", end))
>
>           user
>                      system elapsed
>                                                                      134.714
>                        0.352 134.978
>
>
>         Indeed, even
>                      this takes a
>                                   long time (in a
>                                                            fresh session):
>
>
>         set.seed(1000)
>                                                                      end <-
>                      sample(1e8, 1e6)
>                                                                      end <-
>                      sample(1e8, 1e7)
>
>                                   system.time(as.character(end))
>
>           user
>                      system elapsed
>
>           57.224
>                        0.156  57.366
>
>                                                                      But
>         running
>                      it a second
>                                   time is faster (about
>                                                            what one would
>
>         expect?):
>
>
>                        system.time(levels <-
>                                   as.character(end))
>
>           user
>                      system elapsed
>
>           23.582
>                        0.021  23.589
>
>                                                                      I
>         did some
>                      simple
>                                   profiling of R to find that
>                                                            the resizing of
>                                                                      the
>         string
>
>         hash table
>                      is not a
>                                   significant component of
>                                                            the time. So
>         maybe
>
>         something
>                                                                      to
>         do with
>                      the R heap/gc?
>                                   No time right now to
>                                                            go deeper. But I
>
>         know Martin
>
>         likes this
>                      sort of
>                      thing ;)
>
>                                                                      Michael
>
>
>                      [[alternative
>                                   HTML version deleted]]
>
>
>
>
>
>         _________________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>
>                      <mailto:Bioc-devel at r-project.
>         <mailto:Bioc-devel at r-project.>____org
>                      <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>>
>
>                        <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.
>                      <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.>______org
>                                   <mailto:Bioc-devel at r-project.
>         <mailto:Bioc-devel at r-project.>____org
>                      <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>>>
>
>                        <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project> <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>>.
>                                   <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>
>                      <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>>.__>______org
>
>                        <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.
>                      <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.>______org
>                                   <mailto:Bioc-devel at r-project.
>         <mailto:Bioc-devel at r-project.>____org
>                      <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>>>>
>
>         mailing list
>
>         https://stat.ethz.ch/mailman/__________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>>
>
>
>
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>>
>
>
>
>
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>                      <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>>
>
>
>
>
>
>         _________________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>
>                      <mailto:Bioc-devel at r-project.
>         <mailto:Bioc-devel at r-project.>____org
>                      <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>>
>
>                        <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.
>                      <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.>______org
>                                   <mailto:Bioc-devel at r-project.
>         <mailto:Bioc-devel at r-project.>____org
>                      <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>>>
>
>                        <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project> <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>>.
>                                   <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>
>                      <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>>.__>______org
>
>                        <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.
>                      <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.>______org
>                                   <mailto:Bioc-devel at r-project.
>         <mailto:Bioc-devel at r-project.>____org
>                      <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>>>> mailing
>                                                            list
>         https://stat.ethz.ch/mailman/__________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>>
>
>
>
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>>
>
>
>
>
>
>
>         <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>                      <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>>
>
>
>
>
>                                                            --
>                                                            Computational
>         Biologist
>                                                            Genentech
>         Research
>
>
>
>
>
>
>           [[alternative HTML
>                      version
>                      deleted]]
>
>
>
>                      _______________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>
>                      <mailto:Bioc-devel at r-project.
>         <mailto:Bioc-devel at r-project.>____org
>                      <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>>
>                                   <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.
>                      <mailto:Bioc-devel at r-project
>         <mailto:Bioc-devel at r-project>.>______org
>                                   <mailto:Bioc-devel at r-project.
>         <mailto:Bioc-devel at r-project.>____org
>                      <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>>>
>                                                    mailing list
>         https://stat.ethz.ch/mailman/________listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>
>
>
>
>         <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
>
>
>         <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>                      <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>
>
>
>
>                                                --
>                                                Computational Biology / Fred
>                      Hutchinson Cancer
>                                   Research Center
>                                                1100 Fairview Ave. N.
>                                                PO Box 19024 Seattle, WA
>         98109
>
>                                                Location: Arnold Building
>         M1 B861
>                                                Phone: (206) 667-2793
>         <tel:%28206%29%20667-2793>
>                      <tel:%28206%29%20667-2793>
>                                   <tel:%28206%29%20667-2793>
>                      <tel:%28206%29%20667-2793>
>
>
>
>
>
>
>                                        --
>                                        Computational Biologist
>                                        Genentech Research
>
>
>
>
>                                   --
>                                   Computational Biologist
>                                   Genentech Research
>
>
>
>                               --
>                               Computational Biology / Fred Hutchinson Cancer
>                      Research Center
>                               1100 Fairview Ave. N.
>                               PO Box 19024 Seattle, WA 98109
>
>                               Location: Arnold Building M1 B861
>                               Phone: (206) 667-2793
>         <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
>                      <tel:%28206%29%20667-2793>
>
>
>
>
>
>                      --
>                      Computational Biologist
>                      Genentech Research
>
>
>                  ___________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>>
>                  mailing list
>         https://stat.ethz.ch/mailman/____listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
>
>         --
>         Computational Biologist
>         Genentech Research
>
>
>
>
>
> --
> Computational Biologist
> Genentech Research



More information about the Bioc-devel mailing list