[BioC] VariantAnnotation: Performance and memory issues in readVcf
Ulrich Bodenhofer
bodenhofer at bioinf.jku.at
Wed May 15 08:56:00 CEST 2013
Sorry, I forgot to include some information about my versions of R and
the mentioned packages:
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] VariantAnnotation_1.6.1 Rsamtools_1.12.0 Biostrings_2.28.0
[4] GenomicRanges_1.12.1 IRanges_1.18.0 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.22.1 Biobase_2.20.0 biomaRt_2.16.0
[4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-5
[7] GenomicFeatures_1.12.0 RCurl_1.95-4.1 RSQLite_0.11.2
[10] rtracklayer_1.20.0 stats4_3.0.0 tools_3.0.0
[13] XML_3.96-1.1 zlibbioc_1.6.0
In the meantime, I profiled the readVcf() test to figure out where most
of the time is spent:
> Rprof(memory.profiling=TRUE, gc.profiling=TRUE)
> system.time(res1 <- readVcf(tf, "hg19", param=gr))
user system elapsed
515.390 0.047 516.623
There were 50 or more warnings (use warnings() to see the first 50)
> Rprof(NULL)
> summaryRprof(memory="both")
$by.self
self.time self.pct total.time
total.pct mem.total
".makeMessage" 128.02 24.72 145.66 28.12
22067.9
"structure" 96.72 18.67 140.00 27.03
20838.4
"makeRestartList" 33.12 6.39 126.76 24.47
18921.9
"match" 31.92 6.16 32.76 6.33
4899.5
"<GC>" 24.40 4.71 24.40 4.71
332.0
"$<-" 21.74 4.20 23.22 4.48
3312.1
"doWithOneRestart" 19.94 3.85 383.52 74.05
57610.2
"$" 13.26 2.56 13.30 2.57
2013.4
"conditionMessage" 13.10 2.53 24.04 4.64
3645.3
"lapply" 12.78 2.47 505.10 97.52
76013.9
"warning" 12.04 2.32 282.56 54.56
42405.6
"FUN" 11.34 2.19 500.86 96.71
75547.9
"withRestarts" 10.80 2.09 478.60 92.41
71807.5
"%in%" 7.00 1.35 24.56 4.74
3698.8
"paste" 6.74 1.30 6.74 1.30
1010.4
"[[" 6.00 1.16 6.00 1.16
906.0
"<Anonymous>" 5.68 1.10 503.96 97.30
75073.3
"findRestart" 5.62 1.09 11.66 2.25
1760.5
"sprintf" 5.30 1.02 29.36 5.67
4458.9
"withOneRestart" 5.26 1.02 410.18 79.20
61603.1
"do.call" 5.22 1.01 13.12 2.53
1994.0
"docall" 5.10 0.98 23.28 4.49
3482.4
"makeRestart" 4.76 0.92 79.22 15.30
11743.5
"unlist" 4.12 0.80 6.66 1.29
810.2
"simpleWarning" 3.92 0.76 69.62 13.44
10428.5
"match.fun" 3.56 0.69 3.60 0.70
537.0
"invokeRestart" 3.54 0.68 17.62 3.40
2647.7
".signalSimpleWarning" 3.14 0.61 480.36 92.75
72073.0
"conditionMessage.condition" 2.94 0.57 10.86 2.10
1649.9
"isRestart" 2.36 0.46 2.38 0.46
349.8
"environment" 1.48 0.29 1.48 0.29
226.0
".Call" 1.12 0.22 1.14 0.22
63.2
"parent.frame" 1.04 0.20 1.08 0.21
158.8
"strsplit" 0.76 0.15 0.90 0.17
321.0
"as.character" 0.58 0.11 0.58 0.11
87.8
"mapply" 0.44 0.08 499.36 96.42
74815.4
"slot<-" 0.44 0.08 3.04 0.59 296.7
"deparse" 0.36 0.07 0.36
0.07 9.3
"fun" 0.30 0.06 0.30 0.06
42.8
"initialize" 0.20 0.04 9.64 1.86
1465.9
"endoapply" 0.20 0.04 4.16 0.80
602.5
"list" 0.20 0.04 0.26 0.05
194.8
".collapseLists" 0.18 0.03 506.08 97.71
75986.2
"loadMethod" 0.10 0.02 0.14 0.03
59.6
"standardGeneric" 0.08 0.02 516.52 99.73
77637.0
"==" 0.08 0.02 0.08 0.02
10.0
"withCallingHandlers" 0.06 0.01 498.58 96.27
74587.6
"array" 0.06 0.01 2.86 0.55
227.4
"factor" 0.06 0.01 0.06 0.01
20.9
"eval" 0.04 0.01 499.44 96.43
74842.5
"tryCatch" 0.04 0.01 3.66 0.71
190.5
".forceType" 0.04 0.01 2.02 0.39
120.3
"mode<-" 0.04 0.01 1.96 0.38 100.2
"validObject" 0.04 0.01 1.94 0.37
60.5
"assign" 0.04 0.01 0.22 0.04
187.2
"split.default" 0.04 0.01 0.10 0.02
27.7
"aperm.default" 0.04 0.01 0.06 0.01
10.0
"get" 0.04 0.01 0.04 0.01
11.1
"grep" 0.04 0.01 0.04
0.01 3.0
"ls" 0.04 0.01 0.04 0.01
12.7
"doTryCatch" 0.02 0.00 3.64 0.70
185.4
"tryCatchOne" 0.02 0.00 3.64 0.70
185.4
"as.numeric" 0.02 0.00 0.84 0.16
20.0
"length" 0.02 0.00 0.06 0.01
21.0
"elementMetadata" 0.02 0.00 0.04 0.01
22.8
"exists" 0.02 0.00 0.04 0.01
14.1
"getClass" 0.02 0.00 0.04 0.01
17.2
".getClassFromCache" 0.02 0.00 0.02 0.00
11.0
"is.na" 0.02 0.00 0.02 0.00
19.5
"matrix" 0.02 0.00 0.02
0.00 7.4
"objects" 0.02 0.00 0.02
0.00 7.2
"parent.env" 0.02 0.00 0.02
0.00 6.1
"possibleExtends" 0.02 0.00 0.02
0.00 9.4
"rownames<-" 0.02 0.00 0.02 0.00 31.1
"setNames" 0.02 0.00 0.02 0.00
27.0
$by.total
total.time total.pct mem.total
self.time self.pct
"system.time" 517.92 100.00 77637.0 0.00
0.00
"standardGeneric" 516.52 99.73 77637.0 0.08
0.02
".readVcf" 516.52 99.73 77637.0 0.00
0.00
"readVcf" 516.52 99.73 77637.0 0.00
0.00
".scanVcfToVCF" 516.52 99.73 77637.0 0.00
0.00
".collapseLists" 506.08 97.71 75986.2 0.18
0.03
"sapply" 505.58 97.62 75866.1 0.00
0.00
"scanVcf" 505.20 97.54 75848.8 0.00
0.00
"lapply" 505.10 97.52 76013.9
12.78 2.47
".vcf_scan" 504.20 97.35 75392.5 0.00
0.00
"<Anonymous>" 503.96 97.30 75073.3
5.68 1.10
".unpackVcf" 502.88 97.10 75278.6 0.00
0.00
"FUN" 500.86 96.71 75547.9
11.34 2.19
"Map" 499.46 96.44 74842.6 0.00
0.00
"eval" 499.44 96.43 74842.5 0.04
0.01
"mapply" 499.36 96.42 74815.4 0.44
0.08
".Method" 499.36 96.42 74815.4 0.00
0.00
".unpackVcfTag" 499.12 96.37 74795.4 0.00
0.00
"withCallingHandlers" 498.58 96.27 74587.6 0.06
0.01
".unpackVcfField" 498.58 96.27 74587.6 0.00
0.00
".signalSimpleWarning" 480.36 92.75 72073.0 3.14
0.61
"withRestarts" 478.60 92.41 71807.5
10.80 2.09
"withOneRestart" 410.18 79.20 61603.1 5.26
1.02
"doWithOneRestart" 383.52 74.05 57610.2
19.94 3.85
"warning" 282.56 54.56 42405.6
12.04 2.32
".makeMessage" 145.66 28.12 22067.9
128.02 24.72
"structure" 140.00 27.03 20838.4 96.72
18.67
"makeRestartList" 126.76 24.47 18921.9
33.12 6.39
"makeRestart" 79.22 15.30 11743.5 4.76
0.92
"simpleWarning" 69.62 13.44 10428.5 3.92
0.76
"match" 32.76 6.33 4899.5
31.92 6.16
"sprintf" 29.36 5.67 4458.9 5.30
1.02
"%in%" 24.56 4.74 3698.8 7.00
1.35
"<GC>" 24.40 4.71 332.0
24.40 4.71
"conditionMessage" 24.04 4.64 3645.3
13.10 2.53
"docall" 23.28 4.49 3482.4 5.10
0.98
"$<-" 23.22 4.48 3312.1
21.74 4.20
"invokeRestart" 17.62 3.40 2647.7 3.54
0.68
"$" 13.30 2.57 2013.4
13.26 2.56
"do.call" 13.12 2.53 1994.0 5.22
1.01
"findRestart" 11.66 2.25 1760.5 5.62
1.09
"conditionMessage.condition" 10.86 2.10 1649.9 2.94
0.57
"initialize" 9.64 1.86 1465.9 0.20
0.04
"new" 9.64 1.86 1465.9 0.00
0.00
"VCF" 7.70 1.49 1443.5 0.00
0.00
"SummarizedExperiment" 7.66 1.48 1393.2 0.00
0.00
".local" 6.82 1.32 944.4 0.00
0.00
"paste" 6.74 1.30 1010.4 6.74
1.30
"unlist" 6.66 1.29 810.2 4.12
0.80
"[[" 6.00 1.16 906.0 6.00
1.16
"endoapply" 4.16 0.80 602.5 0.20
0.04
"tryCatch" 3.66 0.71 190.5 0.04
0.01
"doTryCatch" 3.64 0.70 185.4 0.02
0.00
"tryCatchOne" 3.64 0.70 185.4 0.02
0.00
"tryCatchList" 3.64 0.70 185.4 0.00
0.00
"match.fun" 3.60 0.70 537.0 3.56
0.69
"slot<-" 3.04 0.59 296.7
0.44 0.08
"array" 2.86 0.55 227.4 0.06
0.01
"try" 2.40 0.46 99.3 0.00
0.00
"isRestart" 2.38 0.46 349.8 2.36
0.46
".formatInfo" 2.36 0.46 155.4 0.00
0.00
".forceType" 2.02 0.39 120.3 0.04
0.01
"mode<-" 1.96 0.38 100.2
0.04 0.01
"validObject" 1.94 0.37 60.5 0.04
0.01
"new2" 1.94 0.37 49.9 0.00
0.00
".formatList" 1.88 0.36 43.5 0.00
0.00
".coerceToList" 1.86 0.36 35.5 0.00
0.00
"NumericList" 1.82 0.35 18.2 0.00
0.00
"PartitioningByEnd" 1.80 0.35 7.2 0.00
0.00
"environment" 1.48 0.29 226.0 1.48
0.29
"gc" 1.40 0.27 0.0 0.00
0.00
".Call" 1.14 0.22 63.2 1.12
0.22
"scanTabix" 1.14 0.22 80.2 0.00
0.00
".tabix_scan" 1.14 0.22 80.2 0.00
0.00
"parent.frame" 1.08 0.21 158.8 1.04
0.20
"strsplit" 0.90 0.17 321.0 0.76
0.15
"SimpleList" 0.86 0.17 468.0 0.00
0.00
"as.numeric" 0.84 0.16 20.0 0.02
0.00
"as.character" 0.58 0.11 87.8 0.58
0.11
"as" 0.58 0.11 86.1 0.00
0.00
"asMethod" 0.56 0.11 79.4 0.00
0.00
"DataFrame" 0.54 0.10 91.6 0.00
0.00
"deparse" 0.36 0.07 9.3 0.36
0.07
"scanBcfHeader" 0.34 0.07 47.3 0.00
0.00
"scanVcfHeader" 0.34 0.07 47.3 0.00
0.00
".bcfHeaderAsSimpleList" 0.32 0.06 47.2 0.00
0.00
"fun" 0.30 0.06 42.8 0.30
0.06
"list" 0.26 0.05 194.8 0.20
0.04
"assign" 0.22 0.04 187.2 0.04
0.01
"envRefSetField" 0.18 0.03 168.6 0.00
0.00
"tapply" 0.16 0.03 39.3 0.00
0.00
".vcf_scan_header_maps" 0.16 0.03 23.6 0.00
0.00
"loadMethod" 0.14 0.03 59.6 0.10
0.02
".unpackVcfInfo" 0.14 0.03 59.2 0.00
0.00
"split" 0.12 0.02 33.7 0.00
0.00
"which" 0.12 0.02 24.1 0.00
0.00
"split.default" 0.10 0.02 27.7 0.04
0.01
"==" 0.08 0.02 10.0 0.08
0.02
"as.factor" 0.08 0.02 33.7 0.00
0.00
"factor" 0.06 0.01 20.9 0.06
0.01
"aperm.default" 0.06 0.01 10.0 0.04
0.01
"length" 0.06 0.01 21.0 0.02
0.00
"aperm" 0.06 0.01 10.0 0.00
0.00
"as.list" 0.06 0.01 18.0 0.00
0.00
"get" 0.04 0.01 11.1 0.04
0.01
"grep" 0.04 0.01 3.0 0.04
0.01
"ls" 0.04 0.01 12.7 0.04
0.01
"elementMetadata" 0.04 0.01 22.8 0.02
0.00
"exists" 0.04 0.01 14.1 0.02
0.00
"getClass" 0.04 0.01 17.2 0.02
0.00
"anyStrings" 0.04 0.01 22.8 0.00
0.00
".classEnv" 0.04 0.01 12.7 0.00
0.00
"getClassDef" 0.04 0.01 14.1 0.00
0.00
"IntegerList" 0.04 0.01 17.3 0.00
0.00
"is" 0.04 0.01 17.4 0.00
0.00
"listClassName" 0.04 0.01 14.1 0.00
0.00
"loadedNamespaces" 0.04 0.01 12.7 0.00
0.00
"mcols" 0.04 0.01 22.8 0.00
0.00
"ncol" 0.04 0.01 8.2 0.00
0.00
"relist" 0.04 0.01 13.4 0.00
0.00
".requirePackage" 0.04 0.01 12.7 0.00
0.00
"seqnames" 0.04 0.01 28.6 0.00
0.00
"validityMethod" 0.04 0.01 22.8 0.00
0.00
".getClassFromCache" 0.02 0.00 11.0 0.02
0.00
"is.na" 0.02 0.00 19.5 0.02
0.00
"matrix" 0.02 0.00 7.4 0.02
0.00
"objects" 0.02 0.00 7.2 0.02
0.00
"parent.env" 0.02 0.00 6.1 0.02
0.00
"possibleExtends" 0.02 0.00 9.4 0.02
0.00
"rownames<-" 0.02 0.00 31.1 0.02
0.00
"setNames" 0.02 0.00 27.0 0.02
0.00
"as.vector" 0.02 0.00 5.7 0.00
0.00
"CharacterList" 0.02 0.00 8.0 0.00
0.00
"checkAtAssignment" 0.02 0.00 11.0 0.00
0.00
"close" 0.02 0.00 10.0 0.00
0.00
"close.TabixFile" 0.02 0.00 10.0 0.00
0.00
"coercerToClass" 0.02 0.00 9.9 0.00
0.00
"colnames" 0.02 0.00 19.1 0.00
0.00
"dim" 0.02 0.00 19.1 0.00
0.00
"dimnames<-" 0.02 0.00 31.1 0.00
0.00
"elementLengths" 0.02 0.00 6.5 0.00
0.00
"elementMetadata<-" 0.02 0.00 19.1 0.00
0.00
".findInheritedMethods" 0.02 0.00 7.2 0.00
0.00
".findMethodInTable" 0.02 0.00 9.4 0.00
0.00
".formatALT" 0.02 0.00 5.2 0.00
0.00
"getDanglingSeqlevels" 0.02 0.00 15.7 0.00
0.00
"is.factor" 0.02 0.00 12.9 0.00
0.00
"mcols<-" 0.02 0.00 19.1 0.00
0.00
"names<-" 0.02 0.00 11.0 0.00
0.00
"newCompressedList0" 0.02 0.00 7.4 0.00
0.00
"newGRanges" 0.02 0.00 3.7 0.00
0.00
"nrow" 0.02 0.00 6.6 0.00
0.00
"PartitioningByWidth" 0.02 0.00 5.2 0.00
0.00
".quickCoerceSelect" 0.02 0.00 9.4 0.00
0.00
"rowData" 0.02 0.00 19.1 0.00
0.00
"ScanBcfParam" 0.02 0.00 6.1 0.00
0.00
"ScanVcfParam" 0.02 0.00 6.1 0.00
0.00
"selectMethod" 0.02 0.00 7.2 0.00
0.00
"seqinfo<-" 0.02 0.00 15.7 0.00
0.00
"seqlevels" 0.02 0.00 15.7 0.00
0.00
"splitAsList" 0.02 0.00 6.1 0.00
0.00
".splitAsList_by_Rle" 0.02 0.00 6.1 0.00
0.00
"splitAsListReturnedClass" 0.02 0.00 6.1 0.00
0.00
".toDNAStringSetList" 0.02 0.00 5.2 0.00
0.00
"topenv" 0.02 0.00 6.1 0.00
0.00
"unique" 0.02 0.00 0.1 0.00
0.00
"valid.func" 0.02 0.00 3.7 0.00
0.00
".valid.GenomicRanges.mcols" 0.02 0.00 19.1 0.00
0.00
".valid.VCF.fixed" 0.02 0.00 19.1 0.00
0.00
".valid.Vector.mcols" 0.02 0.00 3.7 0.00
0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 517.92
I hope this helps to clarify my questions.
Thanks to everybody in advance,
Ulrich
On 05/14/2013 05:59 PM, Ulrich Bodenhofer wrote:
> Hi,
>
> I am currently working on a package that takes input from VCF files
> (possibly large ones) and I wanted to use readVcf() from the
> VariantAnnotation package for reading the data. However, I have run
> into severe performance and memory issues. The following test report
> is based on a bgzipped and tabix-indexed VCF file with 100
> samples/columns and approx 1,9 millions of variants/rows. I have
> previously failed to load data of this size entirely using readVcf(),
> so I read the data in chunks. I always had the impression that this
> was quite slow, but now I compared it with tabix on the command line
> and scanTabix() from the Rsamtools packages and the results were
> stunning. Here are some code chunks. As you can see, I am trying to
> read a 1,400,001bp region from chr1. This region actually contains
> 8,757 variants/rows:
>
> > tf <- TabixFile("testFile100.vcf.gz")
> > gr <- GRanges(seqnames="1", ranges=IRanges(start=100000,
> end=1500000))
> >
> > system.time(res1 <- readVcf(tf, "hg19", param=gr))
> user system elapsed
> 540.080 0.624 541.906
> There were 50 or more warnings (use warnings() to see the first 50)
>
> The scanTabix() function from the Rsamtools package gives the
> following result:
>
> > system.time(res2 <- scanTabix(tf, param=gr))
> user system elapsed
> 0.170 0.002 0.171
>
> The tabix command line tool takes approximately the same amount of
> time. I admit that scanTabix() just reads the data and does no parsing
> or any other processing, so I digged deeper and saw that readVcf()
> calls scanTabix() inside the function .vcf_scan(). Interestingly, this
> is done in a way that all the parsing is done inside the scanTabix()
> function by supplying a function that does the parsing through the
> undocumented tbxsym argument. So I tried the same approach:
>
> > maps <- VariantAnnotation:::.vcf_scan_header_maps(tf,
> fixed=character(), info=NA,
> + geno="GT")
> Warning message:
> In .bcfHeaderAsSimpleList(header) :
> duplicate keys in header will be forced to unique rownames
> > tbxstate <- maps[c("samples", "fmap", "imap", "gmap")]
> > tbxsym <- getNativeSymbolInfo(".tabix_as_vcf", "VariantAnnotation")
> >
> > system.time(res3 <- scanTabix(tf, param=gr, tbxsym=tbxsym,
> tbxstate=tbxstate))
>>
>
>
> --
> ------------------------------------------------------------------------
> *Dr. Ulrich Bodenhofer*
> Associate Professor
> Institute of Bioinformatics
>
> *Johannes Kepler University*
> Altenberger Str. 69
> 4040 Linz, Austria
>
> Tel. +43 732 2468 4526
> Fax +43 732 2468 4539
> bodenhofer at bioinf.jku.at <mailto:bodenhofer at bioinf.jku.at>
> http://www.bioinf.jku.at/ <http://www.bioinf.jku.at>
> user system elapsed
> 0.511 0.006 0.517
>
> So, even if I include the same way of parsing VCF data as readVcf(),
> the total time is still in the range of half a second, which is still
> approx. 1000 times faster than calling readVcf(). So where is all the
> performance lost? I can hardly imagine that 539.5 of 540 seconds are
> spent on data transformations.
>
> A similar situation can be observed in terms of memory consumption:
> after having loaded the VariantAnnotation package (and all packages it
> depends upon), my R process occupies about 185MB main memory. Reading
> and parsing the data with scanTabix() increases the memory consumption
> by about 40MB, whereas calling readVcf() increases the memory
> consumption by more than 400MB . I do not think that such an amount of
> memory is needed to accommodate the output object res3; object.size()
> says it's about 8MB, but I know that these figure need not be accurate.
>
> Actually, I do not need the whole flexibility of readVcf(). If
> necessary, I would be satisfied with a workaround like the one based
> on scanTabix() above. However, I do not like the idea too much to use
> undocumented internals of other packages in my package. If possible, I
> would rather prefer to rely on readVcf(). So, I would be extremely
> grateful if someone could either explain the situation or, in case
> there are bugs, fix them.
>
> Thanks and best regards,
> Ulrich
>
>
> ------------------------------------------------------------------------
> *Dr. Ulrich Bodenhofer*
> Associate Professor
> Institute of Bioinformatics
>
> *Johannes Kepler University*
> Altenberger Str. 69
> 4040 Linz, Austria
>
> Tel. +43 732 2468 4526
> Fax +43 732 2468 4539
> bodenhofer at bioinf.jku.at <mailto:bodenhofer at bioinf.jku.at>
> http://www.bioinf.jku.at/ <http://www.bioinf.jku.at>
More information about the Bioconductor
mailing list