[BioC] Why is writeVcf() slow?

Sat Oct 27 00:14:45 CEST 2012

Hi Pete,

Thanks for sending the data. I confirmed your numbers for writing out 
the sample file.

 > load('~/Downloads/slow_writeVcf_example_data.RData')
 > vcf <- vcf.4aff.constrained
 > vcf
class: VCF
dim: 2668 7
genome: mm9
...
 > system.time(writeVcf(vcf, "mm9.vcf"))
    user  system elapsed
164.362   0.548 165.378
Warning message:
In .Method(..., deparse.level = deparse.level) :
   number of rows of result is not a multiple of vector length (arg 1)

Sample data from the package with ~10K variants was also slow.

 > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
 > vcf <- readVcf(fl, "hg19")
 > vcf
class: VCF
dim: 10376 5
genome: hg19
...
 > system.time(writeVcf(vcf, "hg19.vcf"))
     user   system  elapsed
1113.089    2.052 1117.831
Warning message:
In .Method(..., deparse.level = deparse.level) :
   number of rows of result is not a multiple of vector length (arg 1)

The warning came from a problem when creating the FORMAT lines for the 
geno fields when we both list and non-list matrices are present. The 
speed issues I identified were parsing bottlenecks when preparing the 
info and geno data for writing.

With the modifications these two examples are showing a >95% speed increase.

## Note that the VCF class structure has changed in VariantAnnotation ## 
in devel. VCF is now a VIRTUAL class with subclasses of CollapsedVCF ## 
and ExpandedVCF. Old VCF instances must be updated to CollapsedVCF ## 
objects. See devel ?VCF for details.
 > load("~/Downloads/slow_writeVcf_example_data.RData")
 > vcf <- updateObject(vcf.4aff.constrained)
 > vcf
class: CollapsedVCF
dim: 2668 7
genome: mm9
...
 > system.time(writeVcf(vcf, "mm9.vcf"))
    user  system elapsed
   2.276   0.060   2.339

Again the sample file from the package,

 > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
 > vcf <- readVcf(fl, "hg19")
 > vcf
class: CollapsedVCF
dim: 10376 5
genome: hg19
...
 > system.time(writeVcf(vcf, "hg19.vcf"))
    user  system elapsed
   4.717   0.012   4.738

Fixes are in VariantAnnotation 1.5.7 in devel and 1.4.2 in release. 
Thanks for reporting this.

Valerie

On 10/23/2012 09:08 PM, hickey at wehi.EDU.AU wrote:
> Valerie - I'll send you a subset of the VCF off list.
>
> Tim - This isn't something I'd considered. I'd always thought of VCF as
> just another delimited format but that post highlights the "intricacies"
> of VCF that I hadn't considered. In saying that, the files I'm dealing
> with here are pretty small (300K variants on 7 samples, 21Mb bgzipped),
> especially compared to the 1KG VCFs which are multi-Gb in size and
> contain thousands of samples.
>
> Pete
>
>
> On 24/10/2012, at 1:52 PM, Tim Triche, Jr. wrote:
>
>> is part of this an issue with the VCF format itself?
>>
>> http://campagnelab.org/evaluating-goby-against-the-1000-genome-genotype-calls-and-why-is-vcf-so-inefficient/
>>
>> or is that a separate concern?
>>
>> thanks,
>>
>> --t
>>
>>
>> On Tue, Oct 23, 2012 at 7:30 PM, Valerie Obenchain <vobencha at fhcrc.org
>> <mailto:vobencha at fhcrc.org>> wrote:
>>
>>     Hi Peter,
>>
>>     I'd say yes, the warning is a cause for concern. Is the vcf you're
>>     working with publicly available?  If not, can you send a smaller
>>     subset that still produces the error?
>>
>>     Valerie
>>
>>
>>
>>     On 10/23/12 17:29, hickey at wehi.EDU.AU <mailto:hickey at wehi.EDU.AU>
>>     wrote:
>>
>>         I am using the writeVcf() function in the VariantAnnotation
>>         package, which I understand to be 'under construction' from
>>         the documentation. I am wondering why writeVcf() is taking so
>>         long to do it's job and if there is anything I can do to speed
>>         this up? For example, I have a VCF object containing some
>>         36,000 variants that I want to write to file and this takes
>>         nearly 3 hours. It also produces a warning that I am unsure
>>         how to interpret, i.e. should I be worried by it? The VCF
>>         written to disk appears to be valid.
>>
>>         Many thanks for your help.
>>
>>         ## Code
>>
>>             library(VariantAnnotation)
>>             vcf<- readVcf(file =
>>             'UnifiedGenotyper.autosomal.__snps.indels.vcf.gz', genome
>>             = 'mm9')
>>             vcf
>>
>>         class: VCF
>>         dim: 292155 7
>>         genome: mm9
>>         exptData(1): header
>>         fixed(4): REF ALT QUAL FILTER
>>         info(21): AC AF ... SB STR
>>         geno(5): AD DP GQ GT PL
>>         rownames(292155): chr1:3003110 chr1:3012398 ... chr19:61340696
>>            chr19:61340708
>>         rowData values names(1): paramRangeID
>>         colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385
>>         506/Bcl2#409
>>         colData names(1): Samples
>>
>>         ## Some subsetting of the 'vcf' object to create
>>         vcf.2aff.constrained
>>
>>             vcf.2aff.constrained
>>
>>         class: VCF
>>         dim: 35514 7
>>         genome: mm9
>>         exptData(1): header
>>         fixed(4): REF ALT QUAL FILTER
>>         info(21): AC AF ... SB STR
>>         geno(5): AD DP GQ GT PL
>>         rownames(35514): chr1:3087036 chr1:3183065 ... chr19:61340449
>>            chr19:61340453
>>         rowData values names(1): paramRangeID
>>         colnames(7): 506/Bcl2#292 506/Bcl2#306 ... 506/Bcl2#385
>>         506/Bcl2#409
>>         colData names(1): Samples
>>
>>             system.time(writeVcf(obj = vcf.2aff.constrained, filename
>>             =
>>             'UnifiedGenotyper.autosomal.__snps.indels.constrained.__genotypes.2aff.vcf'))
>>
>>               user    system   elapsed
>>         10653.768     0.244 10659.163
>>         Warning message:
>>         In .Method(..., deparse.level = deparse.level) :
>>            number of rows of result is not a multiple of vector length
>>         (arg 1)
>>
>>             sessionInfo()
>>
>>         R version 2.15.1 (2012-06-22)
>>         Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>         locale:
>>           [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>           [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>           [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>           [7] LC_PAPER=C                 LC_NAME=C
>>           [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>         [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>         attached base packages:
>>         [1] stats     graphics  grDevices utils     datasets  methods
>>           base
>>
>>         other attached packages:
>>         [1] VariantAnnotation_1.4.1 Rsamtools_1.10.1
>>          Biostrings_2.26.2
>>         [4] GenomicRanges_1.10.2    IRanges_1.16.2
>>          BiocGenerics_0.4.0
>>
>>         loaded via a namespace (and not attached):
>>           [1] AnnotationDbi_1.20.1   Biobase_2.18.0         biomaRt_2.14.0
>>           [4] bitops_1.0-4.1         BSgenome_1.26.1        DBI_0.2-5
>>           [7] GenomicFeatures_1.10.0 parallel_2.15.1        RCurl_1.95-1.1
>>         [10] RSQLite_0.11.2         rtracklayer_1.18.0     stats4_2.15.1
>>         [13] tools_2.15.1           XML_3.95-0.1           zlibbioc_1.4.0
>>
>>         ------------------------------__--
>>         Peter Hickey,
>>         PhD Student/Research Assistant,
>>         Bioinformatics Division,
>>         Walter and Eliza Hall Institute of Medical Research,
>>         1G Royal Parade, Parkville, Vic 3052, Australia.
>>         Ph: +613 9345 2324 <tel:%2B613%209345%202324>
>>
>>         hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
>>         http://www.wehi.edu.au <http://www.wehi.edu.au/>
>>
>>
>>         __________________________________________________________________________
>>         The information in this email is confidential and
>>         intend...{{dropped:8}}
>>
>>         _________________________________________________
>>         Bioconductor mailing list
>>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>         Search the archives:
>>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>     _________________________________________________
>>     Bioconductor mailing list
>>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>     Search the archives:
>>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>     <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>>
>> --
>> /A model is a lie that helps you see the truth./
>> /
>> /
>> Howard Skipper
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
> http://www.wehi.edu.au
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:6}}