[Bioc-devel] Help understanding an R performance issue
Michael Lawrence
lawrence.michael at gene.com
Thu Jun 29 20:48:30 CEST 2017
Preliminary analysis suggests that this is due to hash misses. When
that happens, R ends up doing costly string comparisons that are on
the order of n^2 where 'n' is the length of the subscript. Looking
into it.
On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
> Hi all,
>
> This is not strictly a Bioconductor question, but I hope some of the experts
> here can help me understand what's going on with a performance issue I've
> found working on a package.
>
> It has to do with selecting elements from a named vector.
>
> If we have a vector with the names of the chromosomes and their order
>
> chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
> chrs
>
> chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13
> chr14 chr15 chr16 chr17
> 1 2 3 4 5 6 7 8 9 10 11 12 13
> 14 15 16 17
> chr18 chr19 chr20 chr21 chr22 chrX chrY
> 18 19 20 21 22 23 24
>
> And we have a second vector of chromosomes (in this case, the chromosomes
> from SNP-array probes)
> And we want to use the second vector to select from the first one by name
>
> cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
> rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
> rep("chrX", 17498), rep("chrY", 1296))
> print(system.time(replicate(10, chrs[cc])))
>
> user system elapsed
> 0.136 0.004 0.141
>
> It's fast.
>
> However, if I get the wrong names for the last two chromosomes (chr23 and
> chr24 instead of chrX and chrY)
>
> cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
> rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
> rep("chr23", 17498), rep("chr24", 1296))
> print(system.time(replicate(10, chrs[cc2])))
>
> user system elapsed
> 144.672 0.012 144.675
>
>
> It is MUCH slower. (1000x)
>
>
> BUT, if I shuffle the elements in the second vector
>
> cc3 <- sample(cc2, length(cc), replace = FALSE)
> print(system.time(replicate(10, chrs[cc3])))
>
> user system elapsed
> 0.096 0.004 0.102
>
> It's fast again!!!
>
>
>
> The elapsed time is related to the number of elements BEFORE the failing
> names,
>
> cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296))
> print(system.time(replicate(10, chrs[cc4])))
>
> user system elapsed
> 17.332 0.004 17.336
>
> cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
> print(system.time(replicate(10, chrs[cc5])))
>
> user system elapsed
> 1.872 0.000 1.901
>
>
> so my guess is that it might come from moving around the vector in memory
> for each "failed" selection or something similar...
>
> Is it correct? Is there anything I'm missing?
>
> Thanks a lot
>
> Bernat
>
> --
>
> *Bernat Gel Moreno*
> Bioinformatician
>
> Hereditary Cancer Program
> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
> Germans Trias i Pujol Research Institute (IGTP)
>
> Campus Can Ruti
> Carretera de Can Ruti, Camí de les Escoles s/n
> 08916 Badalona, Barcelona, Spain
>
> Tel: (+34) 93 554 3068
> Fax: (+34) 93 497 8654
> 08916 Badalona, Barcelona, Spain
> bgel at igtp.cat <mailto:bgel at igtp.cat>
> www.germanstrias.org <http://www.germanstrias.org/>
>
> <http://www.germanstrias.org/>
>
>
>
>
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list