[Bioc-devel] VariantAnnotation: filterVcf with chunks duplicates header
Julian Gehring
julian.gehring at embl.de
Fri Aug 7 16:25:00 CEST 2015
Hi,
When I use the 'filterVcf' function to process a VCF file in chunks, the
output file contains as many copies of the header as there are
chunks. Consider the example, adapted from the 'filterVcf' man page:
library(VariantAnnotation)
fl <- system.file(package = "VariantAnnotation", "extdata", "chr22.vcf.gz")
filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))
## Control: Processing without chunking
t0 <- TabixFile(fl) ## yieldSize -> NA
out0 <- filterVcf(t0, "hg19", tempfile(), filters = filt)
tx0 = readLines(out0)
header_lines0 = grep("^##", tx0)
## 30 header lines, in line 1,..,30
## Case: Processing in 3 chunks
t1 <- TabixFile(fl, yieldSize = 5e3) ## gives us 3 chunks here
out1 <- filterVcf(t1, "hg19", tempfile(), filters = filt)
tx1 = readLines(out1)
header_lines1 = grep("^##", tx1)
## 90 header lines, header 3 times duplicated
## 1) 1,..,30, 2) 4827,..,4856, 3) 9673,..,9702
It seems that for each chunk, a complete VCF file including the header
gets written. See the relevant part of VariantFiltering:::.filter:
while (nrow(vcfChunk <- readVcf(tbxFile, genome, ..., param = param))) {
vcfChunk <- subsetByFilter(vcfChunk, filters)
writeVcf(vcfChunk, filtered)
}
For the processing of VCFs with large headers in many chunks
(e.g. 1000genomes callsets), this can result in the paradox situation
that the filtered file ends up being significantly larger than the
original.
Best wishes
Julian
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] VariantAnnotation_1.14.6 Rsamtools_1.20.4 Biostrings_2.36.1
[4] XVector_0.8.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1
[7] IRanges_2.2.5 S4Vectors_0.6.3 BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.30.1 zlibbioc_1.14.0 GenomicAlignments_1.4.1
[4] BiocParallel_1.2.9 BSgenome_1.36.3 tools_3.2.1
[7] Biobase_2.28.0 DBI_0.3.1 lambda.r_1.1.7
[10] futile.logger_1.4.1 rtracklayer_1.28.6 futile.options_1.0.0
[13] bitops_1.0-6 RCurl_1.95-4.7 biomaRt_2.24.0
[16] RSQLite_1.0.0 GenomicFeatures_1.20.1 XML_3.98-1.3
More information about the Bioc-devel
mailing list