[Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

Dario Strbenac dstr7320 at uni.sydney.edu.au
Wed Sep 28 00:00:14 CEST 2016


Good day,

file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation")
anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = "VariantAnnotation")
aSet <- readVcf(file, "hg19")
system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet)))
   user  system elapsed 
209.120  16.628 226.083 

Reading in the Exome chromosome 22 VCF and intersecting it with the other file in the data directory takes almost 4 minutes.

However, reading in the whole file is much faster.

> system.time(anotherSet <- readVcf(anotherFile, "hg19"))
   user  system elapsed 
  0.376   0.016   0.392 

and doing the intersection manually takes a fraction of a second

> system.time(fastCommonMutations <- intersect(rowRanges(aSet), rowRanges(anotherSet)))
   user  system elapsed 
  0.128   0.000   0.129

This comparison ignores the finer details such as the identities of the alleles, but does it have to be so slow ? My real use case is intersecting dozens of VCF files of cancer samples with the ExAC consortium's VCF file that is 4 GB in size when compressed. I can't imagine how long that would take.

Can the code of readVcf be optimised ?

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia


More information about the Bioc-devel mailing list