[Bioc-devel] VariantAnnotation: filterVcf with chunks duplicates header

Fri Aug 7 20:05:16 CEST 2015

Thanks for the report and reproducible example. Now fixed in 1.14.8 and 
1.15.22.

Valerie

On 08/07/2015 07:25 AM, Julian Gehring wrote:
> Hi,
>
> When I use the 'filterVcf' function to process a VCF file in chunks, the
> output file contains as many copies of the header as there are
> chunks. Consider the example, adapted from the 'filterVcf' man page:
>
>    library(VariantAnnotation)
>
>    fl <- system.file(package = "VariantAnnotation", "extdata", "chr22.vcf.gz")
>    filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))
>
>    ## Control: Processing without chunking
>    t0 <- TabixFile(fl) ## yieldSize -> NA
>    out0 <- filterVcf(t0, "hg19", tempfile(), filters = filt)
>
>    tx0 = readLines(out0)
>    header_lines0 = grep("^##", tx0)
>    ## 30 header lines, in line 1,..,30
>
>    ## Case: Processing in 3 chunks
>    t1 <- TabixFile(fl, yieldSize = 5e3) ## gives us 3 chunks here
>    out1 <- filterVcf(t1, "hg19", tempfile(), filters = filt)
>
>    tx1 = readLines(out1)
>    header_lines1 = grep("^##", tx1)
>    ## 90 header lines, header 3 times duplicated
>    ## 1) 1,..,30, 2) 4827,..,4856, 3) 9673,..,9702
>
> It seems that for each chunk, a complete VCF file including the header
> gets written.  See the relevant part of VariantFiltering:::.filter:
>
>     while (nrow(vcfChunk <- readVcf(tbxFile, genome, ..., param = param))) {
>       vcfChunk <- subsetByFilter(vcfChunk, filters)
>       writeVcf(vcfChunk, filtered)
>     }
>
> For the processing of VCFs with large headers in many chunks
> (e.g. 1000genomes callsets), this can result in the paradox situation
> that the filtered file ends up being significantly larger than the
> original.
>
> Best wishes
> Julian
>
>
>
> R version 3.2.1 (2015-06-18)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.3 LTS
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats4    parallel  stats     graphics  grDevices utils     datasets
> [8] methods   base
>
> other attached packages:
> [1] VariantAnnotation_1.14.6 Rsamtools_1.20.4         Biostrings_2.36.1
> [4] XVector_0.8.0            GenomicRanges_1.20.5     GenomeInfoDb_1.4.1
> [7] IRanges_2.2.5            S4Vectors_0.6.3          BiocGenerics_0.14.0
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.30.1    zlibbioc_1.14.0         GenomicAlignments_1.4.1
>   [4] BiocParallel_1.2.9      BSgenome_1.36.3         tools_3.2.1
>   [7] Biobase_2.28.0          DBI_0.3.1               lambda.r_1.1.7
> [10] futile.logger_1.4.1     rtracklayer_1.28.6      futile.options_1.0.0
> [13] bitops_1.0-6            RCurl_1.95-4.7          biomaRt_2.24.0
> [16] RSQLite_1.0.0           GenomicFeatures_1.20.1  XML_3.98-1.3
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fredhutch.org
Phone: (206) 667-3158