[Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

Wed Sep 28 15:05:26 CEST 2016

On 09/27/2016 06:00 PM, Dario Strbenac wrote:
> Good day,
>
> file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation")
> anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = "VariantAnnotation")
> aSet <- readVcf(file, "hg19")
> system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet)))
>    user  system elapsed
> 209.120  16.628 226.083
>
> Reading in the Exome chromosome 22 VCF and intersecting it with the other file in the data directory takes almost 4 minutes.
>
> However, reading in the whole file is much faster.
>
>> system.time(anotherSet <- readVcf(anotherFile, "hg19"))
>    user  system elapsed
>   0.376   0.016   0.392
>
> and doing the intersection manually takes a fraction of a second
>
>> system.time(fastCommonMutations <- intersect(rowRanges(aSet), rowRanges(anotherSet)))
>    user  system elapsed
>   0.128   0.000   0.129
>
> This comparison ignores the finer details such as the identities of the alleles, but does it have to be so slow ? My real use case is intersecting dozens of VCF files of cancer samples with the ExAC consortium's VCF file that is 4 GB in size when compressed. I can't imagine how long that would take.
>
> Can the code of readVcf be optimised ?

iterate through the file using yieldSize to manage memory, e.g.,

rngs <- rowRanges(aSet)
genome(rngs) <- "b37"

GenomicFiles::reduceByYield(
     VcfFile(anotherFile, yieldSize=10000),
     readVcf,
     function(YIELD) subsetByOverlaps(YIELD, rngs),
     c)

>
> --------------------------------------
> Dario Strbenac
> University of Sydney
> Camperdown NSW 2050
> Australia
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


This email message may contain legally privileged and/or...{{dropped:2}}