[Bioc-devel] VCF Intersection Using readVcf Remarkably Slow
Vincent Carey
stvjc at channing.harvard.edu
Wed Sep 28 11:55:09 CEST 2016
Dario's computer is faster than mine
> system.time(commonMutations <- readVcf(anotherFile, "hg19",
rowRanges(aSet)))
user system elapsed
426.271 57.296 483.766
The disk infrastructure is a determinant of throughput. Most VCF queries
are decomposable and can be parallelized. After
chunking the indices of aSet to 20 chunks
> system.time(comm <- mclapply(ch, function(x) readVcf(anotherFile, "hg19",
rowRanges(aSet)[x]), mc.cores=20))
user system elapsed
628.307 322.830 51.303
As far as I can tell, the answers agree. Is this a risky approach? Also
the payload of readVcf can be reduced with a ScanVcfParam, that might
improve throughput.
On Tue, Sep 27, 2016 at 6:23 PM, Michael Lawrence <lawrence.michael at gene.com
> wrote:
> I think the basic problem is that each range requires a separate query
> through tabix. BAM and tabix are designed to be fast for single
> queries, like what a genome browser might generate, but not for
> querying thousands of regions at once. At least that's the way it
> seems to me. The index is only at the block level, because the data
> are compressed. In principle, smarter caching could speed this up,
> but that belongs at the samtools level.
>
> To make many queries, it pays to load the data into memory, or a more
> efficient on-disk representation (HDF5, GDS, ...), first.
>
> Michael
>
> On Tue, Sep 27, 2016 at 3:00 PM, Dario Strbenac
> <dstr7320 at uni.sydney.edu.au> wrote:
> > Good day,
> >
> > file <- system.file("extdata", "chr22.vcf.gz", package =
> "VariantAnnotation")
> > anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz",
> package = "VariantAnnotation")
> > aSet <- readVcf(file, "hg19")
> > system.time(commonMutations <- readVcf(anotherFile, "hg19",
> rowRanges(aSet)))
> > user system elapsed
> > 209.120 16.628 226.083
> >
> > Reading in the Exome chromosome 22 VCF and intersecting it with the
> other file in the data directory takes almost 4 minutes.
> >
> > However, reading in the whole file is much faster.
> >
> >> system.time(anotherSet <- readVcf(anotherFile, "hg19"))
> > user system elapsed
> > 0.376 0.016 0.392
> >
> > and doing the intersection manually takes a fraction of a second
> >
> >> system.time(fastCommonMutations <- intersect(rowRanges(aSet),
> rowRanges(anotherSet)))
> > user system elapsed
> > 0.128 0.000 0.129
> >
> > This comparison ignores the finer details such as the identities of the
> alleles, but does it have to be so slow ? My real use case is intersecting
> dozens of VCF files of cancer samples with the ExAC consortium's VCF file
> that is 4 GB in size when compressed. I can't imagine how long that would
> take.
> >
> > Can the code of readVcf be optimised ?
> >
> > --------------------------------------
> > Dario Strbenac
> > University of Sydney
> > Camperdown NSW 2050
> > Australia
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list