[Bioc-devel] Help understanding an R performance issue

Thu Jun 29 20:48:30 CEST 2017

Preliminary analysis suggests that this is due to hash misses. When
that happens, R ends up doing costly string comparisons that are on
the order of n^2 where 'n' is the length of the subscript. Looking
into it.

On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
> Hi all,
>
> This is not strictly a Bioconductor question, but I hope some of the experts
> here can help me understand what's going on with a performance issue I've
> found working on a package.
>
> It has to do with selecting elements from a named vector.
>
> If we have a vector with the names of the chromosomes and their order
>
>     chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>     chrs
>
> chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13
> chr14 chr15 chr16 chr17
>     1     2     3     4     5     6     7     8     9    10    11 12    13
> 14    15    16    17
> chr18 chr19 chr20 chr21 chr22  chrX  chrY
>    18    19    20    21    22    23    24
>
> And we have a second vector of chromosomes (in this case, the chromosomes
> from SNP-array probes)
> And we want to use the second vector to select from the first one by name
>
>     cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
>         rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>         rep("chrX", 17498), rep("chrY", 1296))
>     print(system.time(replicate(10, chrs[cc])))
>
> user  system elapsed
> 0.136   0.004   0.141
>
> It's fast.
>
> However, if I get the wrong names for the last two chromosomes (chr23 and
> chr24 instead of chrX and chrY)
>
>      cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
>         rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>         rep("chr23", 17498), rep("chr24", 1296))
>      print(system.time(replicate(10, chrs[cc2])))
>
> user  system elapsed
> 144.672   0.012 144.675
>
>
> It is MUCH slower. (1000x)
>
>
> BUT, if I shuffle the elements in the second vector
>
>     cc3 <- sample(cc2, length(cc), replace = FALSE)
>     print(system.time(replicate(10, chrs[cc3])))
>
> user  system elapsed
> 0.096   0.004   0.102
>
> It's fast again!!!
>
>
>
> The elapsed time is related to the number of elements BEFORE the failing
> names,
>
>     cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296))
>     print(system.time(replicate(10, chrs[cc4])))
>
> user  system elapsed
> 17.332   0.004  17.336
>
>     cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
>     print(system.time(replicate(10, chrs[cc5])))
>
> user  system elapsed
> 1.872   0.000   1.901
>
>
> so my guess is that it might come from moving around the vector in memory
> for each "failed" selection or something similar...
>
> Is it correct? Is there anything I'm missing?
>
> Thanks a lot
>
> Bernat
>
> --
>
> *Bernat Gel Moreno*
> Bioinformatician
>
> Hereditary Cancer Program
> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
> Germans Trias i Pujol Research Institute (IGTP)
>
> Campus Can Ruti
> Carretera de Can Ruti, Camí de les Escoles s/n
> 08916 Badalona, Barcelona, Spain
>
> Tel: (+34) 93 554 3068
> Fax: (+34) 93 497 8654
> 08916 Badalona, Barcelona, Spain
> bgel at igtp.cat <mailto:bgel at igtp.cat>
> www.germanstrias.org <http://www.germanstrias.org/>
>
> <http://www.germanstrias.org/>
>
>
>
>
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel