[Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

Vincent Carey stvjc at channing.harvard.edu
Wed Sep 28 11:55:09 CEST 2016


Dario's computer is faster than mine

> system.time(commonMutations <- readVcf(anotherFile, "hg19",
rowRanges(aSet)))

   user  system elapsed

426.271  57.296 483.766

The disk infrastructure is a determinant of throughput.  Most VCF queries
are decomposable and can be parallelized.  After

chunking the indices of aSet to 20 chunks


> system.time(comm <- mclapply(ch, function(x) readVcf(anotherFile, "hg19",
rowRanges(aSet)[x]), mc.cores=20))

   user  system elapsed

628.307 322.830  51.303


As far as I can tell, the answers agree.  Is this a risky approach?  Also
the payload of readVcf can be reduced with a ScanVcfParam, that might
improve throughput.

On Tue, Sep 27, 2016 at 6:23 PM, Michael Lawrence <lawrence.michael at gene.com
> wrote:

> I think the basic problem is that each range requires a separate query
> through tabix. BAM and tabix are designed to be fast for single
> queries, like what a genome browser might generate, but not for
> querying thousands of regions at once. At least that's the way it
> seems to me. The index is only at the block level, because the data
> are compressed.  In principle, smarter caching could speed this up,
> but that belongs at the samtools level.
>
> To make many queries, it pays to load the data into memory, or a more
> efficient on-disk representation (HDF5, GDS, ...), first.
>
> Michael
>
> On Tue, Sep 27, 2016 at 3:00 PM, Dario Strbenac
> <dstr7320 at uni.sydney.edu.au> wrote:
> > Good day,
> >
> > file <- system.file("extdata", "chr22.vcf.gz", package =
> "VariantAnnotation")
> > anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz",
> package = "VariantAnnotation")
> > aSet <- readVcf(file, "hg19")
> > system.time(commonMutations <- readVcf(anotherFile, "hg19",
> rowRanges(aSet)))
> >    user  system elapsed
> > 209.120  16.628 226.083
> >
> > Reading in the Exome chromosome 22 VCF and intersecting it with the
> other file in the data directory takes almost 4 minutes.
> >
> > However, reading in the whole file is much faster.
> >
> >> system.time(anotherSet <- readVcf(anotherFile, "hg19"))
> >    user  system elapsed
> >   0.376   0.016   0.392
> >
> > and doing the intersection manually takes a fraction of a second
> >
> >> system.time(fastCommonMutations <- intersect(rowRanges(aSet),
> rowRanges(anotherSet)))
> >    user  system elapsed
> >   0.128   0.000   0.129
> >
> > This comparison ignores the finer details such as the identities of the
> alleles, but does it have to be so slow ? My real use case is intersecting
> dozens of VCF files of cancer samples with the ExAC consortium's VCF file
> that is 4 GB in size when compressed. I can't imagine how long that would
> take.
> >
> > Can the code of readVcf be optimised ?
> >
> > --------------------------------------
> > Dario Strbenac
> > University of Sydney
> > Camperdown NSW 2050
> > Australia
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list