[Bioc-devel] VariantAnnotation: filterVcf with chunks duplicates header

Fri Aug 7 16:25:00 CEST 2015

Hi,

When I use the 'filterVcf' function to process a VCF file in chunks, the
output file contains as many copies of the header as there are
chunks. Consider the example, adapted from the 'filterVcf' man page:

  library(VariantAnnotation)

  fl <- system.file(package = "VariantAnnotation", "extdata", "chr22.vcf.gz")
  filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))

  ## Control: Processing without chunking
  t0 <- TabixFile(fl) ## yieldSize -> NA
  out0 <- filterVcf(t0, "hg19", tempfile(), filters = filt)

  tx0 = readLines(out0)
  header_lines0 = grep("^##", tx0)
  ## 30 header lines, in line 1,..,30

  ## Case: Processing in 3 chunks
  t1 <- TabixFile(fl, yieldSize = 5e3) ## gives us 3 chunks here
  out1 <- filterVcf(t1, "hg19", tempfile(), filters = filt)

  tx1 = readLines(out1)
  header_lines1 = grep("^##", tx1)
  ## 90 header lines, header 3 times duplicated
  ## 1) 1,..,30, 2) 4827,..,4856, 3) 9673,..,9702

It seems that for each chunk, a complete VCF file including the header
gets written.  See the relevant part of VariantFiltering:::.filter:

   while (nrow(vcfChunk <- readVcf(tbxFile, genome, ..., param = param))) {
     vcfChunk <- subsetByFilter(vcfChunk, filters)
     writeVcf(vcfChunk, filtered)
   }

For the processing of VCFs with large headers in many chunks
(e.g. 1000genomes callsets), this can result in the paradox situation
that the filtered file ends up being significantly larger than the
original.

Best wishes
Julian

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] VariantAnnotation_1.14.6 Rsamtools_1.20.4         Biostrings_2.36.1       
[4] XVector_0.8.0            GenomicRanges_1.20.5     GenomeInfoDb_1.4.1      
[7] IRanges_2.2.5            S4Vectors_0.6.3          BiocGenerics_0.14.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.30.1    zlibbioc_1.14.0         GenomicAlignments_1.4.1
 [4] BiocParallel_1.2.9      BSgenome_1.36.3         tools_3.2.1            
 [7] Biobase_2.28.0          DBI_0.3.1               lambda.r_1.1.7         
[10] futile.logger_1.4.1     rtracklayer_1.28.6      futile.options_1.0.0   
[13] bitops_1.0-6            RCurl_1.95-4.7          biomaRt_2.24.0         
[16] RSQLite_1.0.0           GenomicFeatures_1.20.1  XML_3.98-1.3