[Bioc-devel] VariantAnnotation: filterVcf with chunks duplicates header
Valerie Obenchain
vobencha at fredhutch.org
Fri Aug 7 20:05:16 CEST 2015
Thanks for the report and reproducible example. Now fixed in 1.14.8 and
1.15.22.
Valerie
On 08/07/2015 07:25 AM, Julian Gehring wrote:
> Hi,
>
> When I use the 'filterVcf' function to process a VCF file in chunks, the
> output file contains as many copies of the header as there are
> chunks. Consider the example, adapted from the 'filterVcf' man page:
>
> library(VariantAnnotation)
>
> fl <- system.file(package = "VariantAnnotation", "extdata", "chr22.vcf.gz")
> filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))
>
> ## Control: Processing without chunking
> t0 <- TabixFile(fl) ## yieldSize -> NA
> out0 <- filterVcf(t0, "hg19", tempfile(), filters = filt)
>
> tx0 = readLines(out0)
> header_lines0 = grep("^##", tx0)
> ## 30 header lines, in line 1,..,30
>
> ## Case: Processing in 3 chunks
> t1 <- TabixFile(fl, yieldSize = 5e3) ## gives us 3 chunks here
> out1 <- filterVcf(t1, "hg19", tempfile(), filters = filt)
>
> tx1 = readLines(out1)
> header_lines1 = grep("^##", tx1)
> ## 90 header lines, header 3 times duplicated
> ## 1) 1,..,30, 2) 4827,..,4856, 3) 9673,..,9702
>
> It seems that for each chunk, a complete VCF file including the header
> gets written. See the relevant part of VariantFiltering:::.filter:
>
> while (nrow(vcfChunk <- readVcf(tbxFile, genome, ..., param = param))) {
> vcfChunk <- subsetByFilter(vcfChunk, filters)
> writeVcf(vcfChunk, filtered)
> }
>
> For the processing of VCFs with large headers in many chunks
> (e.g. 1000genomes callsets), this can result in the paradox situation
> that the filtered file ends up being significantly larger than the
> original.
>
> Best wishes
> Julian
>
>
>
> R version 3.2.1 (2015-06-18)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.3 LTS
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats4 parallel stats graphics grDevices utils datasets
> [8] methods base
>
> other attached packages:
> [1] VariantAnnotation_1.14.6 Rsamtools_1.20.4 Biostrings_2.36.1
> [4] XVector_0.8.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1
> [7] IRanges_2.2.5 S4Vectors_0.6.3 BiocGenerics_0.14.0
>
> loaded via a namespace (and not attached):
> [1] AnnotationDbi_1.30.1 zlibbioc_1.14.0 GenomicAlignments_1.4.1
> [4] BiocParallel_1.2.9 BSgenome_1.36.3 tools_3.2.1
> [7] Biobase_2.28.0 DBI_0.3.1 lambda.r_1.1.7
> [10] futile.logger_1.4.1 rtracklayer_1.28.6 futile.options_1.0.0
> [13] bitops_1.0-6 RCurl_1.95-4.7 biomaRt_2.24.0
> [16] RSQLite_1.0.0 GenomicFeatures_1.20.1 XML_3.98-1.3
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109
Email: vobencha at fredhutch.org
Phone: (206) 667-3158
More information about the Bioc-devel
mailing list