[Bioc-devel] Help understanding an R performance issue
Bernat Gel
bgel at igtp.cat
Thu Jun 29 19:43:11 CEST 2017
Hi all,
This is not strictly a Bioconductor question, but I hope some of the
experts here can help me understand what's going on with a performance
issue I've found working on a package.
It has to do with selecting elements from a named vector.
If we have a vector with the names of the chromosomes and their order
chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
chrs
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12
chr13 chr14 chr15 chr16 chr17
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17
chr18 chr19 chr20 chr21 chr22 chrX chrY
18 19 20 21 22 23 24
And we have a second vector of chromosomes (in this case, the
chromosomes from SNP-array probes)
And we want to use the second vector to select from the first one by name
cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
rep("chrX", 17498), rep("chrY", 1296))
print(system.time(replicate(10, chrs[cc])))
user system elapsed
0.136 0.004 0.141
It's fast.
However, if I get the wrong names for the last two chromosomes (chr23
and chr24 instead of chrX and chrY)
cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
14726),
rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
rep("chr23", 17498), rep("chr24", 1296))
print(system.time(replicate(10, chrs[cc2])))
user system elapsed
144.672 0.012 144.675
It is MUCH slower. (1000x)
BUT, if I shuffle the elements in the second vector
cc3 <- sample(cc2, length(cc), replace = FALSE)
print(system.time(replicate(10, chrs[cc3])))
user system elapsed
0.096 0.004 0.102
It's fast again!!!
The elapsed time is related to the number of elements BEFORE the failing
names,
cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296))
print(system.time(replicate(10, chrs[cc4])))
user system elapsed
17.332 0.004 17.336
cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
print(system.time(replicate(10, chrs[cc5])))
user system elapsed
1.872 0.000 1.901
so my guess is that it might come from moving around the vector in
memory for each "failed" selection or something similar...
Is it correct? Is there anything I'm missing?
Thanks a lot
Bernat
--
*Bernat Gel Moreno*
Bioinformatician
Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)
Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain
Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
bgel at igtp.cat <mailto:bgel at igtp.cat>
www.germanstrias.org <http://www.germanstrias.org/>
<http://www.germanstrias.org/>
More information about the Bioc-devel
mailing list