[BioC] VariantAnnotation: Performance and memory issues in readVcf
Ulrich Bodenhofer
bodenhofer at bioinf.jku.at
Tue May 14 17:59:55 CEST 2013
Hi,
I am currently working on a package that takes input from VCF files
(possibly large ones) and I wanted to use readVcf() from the
VariantAnnotation package for reading the data. However, I have run into
severe performance and memory issues. The following test report is based
on a bgzipped and tabix-indexed VCF file with 100 samples/columns and
approx 1,9 millions of variants/rows. I have previously failed to load
data of this size entirely using readVcf(), so I read the data in
chunks. I always had the impression that this was quite slow, but now I
compared it with tabix on the command line and scanTabix() from the
Rsamtools packages and the results were stunning. Here are some code
chunks. As you can see, I am trying to read a 1,400,001bp region from
chr1. This region actually contains 8,757 variants/rows:
> tf <- TabixFile("testFile100.vcf.gz")
> gr <- GRanges(seqnames="1", ranges=IRanges(start=100000,
end=1500000))
>
> system.time(res1 <- readVcf(tf, "hg19", param=gr))
user system elapsed
540.080 0.624 541.906
There were 50 or more warnings (use warnings() to see the first 50)
The scanTabix() function from the Rsamtools package gives the following
result:
> system.time(res2 <- scanTabix(tf, param=gr))
user system elapsed
0.170 0.002 0.171
The tabix command line tool takes approximately the same amount of time.
I admit that scanTabix() just reads the data and does no parsing or any
other processing, so I digged deeper and saw that readVcf() calls
scanTabix() inside the function .vcf_scan(). Interestingly, this is done
in a way that all the parsing is done inside the scanTabix() function by
supplying a function that does the parsing through the undocumented
tbxsym argument. So I tried the same approach:
> maps <- VariantAnnotation:::.vcf_scan_header_maps(tf,
fixed=character(), info=NA,
+ geno="GT")
Warning message:
In .bcfHeaderAsSimpleList(header) :
duplicate keys in header will be forced to unique rownames
> tbxstate <- maps[c("samples", "fmap", "imap", "gmap")]
> tbxsym <- getNativeSymbolInfo(".tabix_as_vcf", "VariantAnnotation")
>
> system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym,
tbxstate=tbxstate))
user system elapsed
0.511 0.006 0.517
So, even if I include the same way of parsing VCF data as readVcf(), the
total time is still in the range of half a second, which is still
approx. 1000 times faster than calling readVcf(). So where is all the
performance lost? I can hardly imagine that 539.5 of 540 seconds are
spent on data transformations.
A similar situation can be observed in terms of memory consumption:
after having loaded the VariantAnnotation package (and all packages it
depends upon), my R process occupies about 185MB main memory. Reading
and parsing the data with scanTabix() increases the memory consumption
by about 40MB, whereas calling readVcf() increases the memory
consumption by more than 400MB . I do not think that such an amount of
memory is needed to accommodate the output object res3; object.size()
says it's about 8MB, but I know that these figure need not be accurate.
Actually, I do not need the whole flexibility of readVcf(). If
necessary, I would be satisfied with a workaround like the one based on
scanTabix() above. However, I do not like the idea too much to use
undocumented internals of other packages in my package. If possible, I
would rather prefer to rely on readVcf(). So, I would be extremely
grateful if someone could either explain the situation or, in case there
are bugs, fix them.
Thanks and best regards,
Ulrich
------------------------------------------------------------------------
*Dr. Ulrich Bodenhofer*
Associate Professor
Institute of Bioinformatics
*Johannes Kepler University*
Altenberger Str. 69
4040 Linz, Austria
Tel. +43 732 2468 4526
Fax +43 732 2468 4539
bodenhofer at bioinf.jku.at <mailto:bodenhofer at bioinf.jku.at>
http://www.bioinf.jku.at/ <http://www.bioinf.jku.at>
More information about the Bioconductor
mailing list