[Bioc-devel] Help understanding an R performance issue

Fri Jun 30 08:58:16 CEST 2017

Yes, that would explain part of the situation. But example cc5 shows 
that hash misses would account only for part of the time.

Thanks for taking a look into it

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
bgel at igtp.cat <mailto:bgel at igtp.cat>
www.germanstrias.org <http://www.germanstrias.org/>

<http://www.germanstrias.org/>

El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:
> Preliminary analysis suggests that this is due to hash misses. When
> that happens, R ends up doing costly string comparisons that are on
> the order of n^2 where 'n' is the length of the subscript. Looking
> into it.
>
> On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
>> Hi all,
>>
>> This is not strictly a Bioconductor question, but I hope some of the experts
>> here can help me understand what's going on with a performance issue I've
>> found working on a package.
>>
>> It has to do with selecting elements from a named vector.
>>
>> If we have a vector with the names of the chromosomes and their order
>>
>>      chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>>      chrs
>>
>> chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13
>> chr14 chr15 chr16 chr17
>>      1     2     3     4     5     6     7     8     9    10    11 12    13
>> 14    15    16    17
>> chr18 chr19 chr20 chr21 chr22  chrX  chrY
>>     18    19    20    21    22    23    24
>>
>> And we have a second vector of chromosomes (in this case, the chromosomes
>> from SNP-array probes)
>> And we want to use the second vector to select from the first one by name
>>
>>      cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
>>          rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>>          rep("chrX", 17498), rep("chrY", 1296))
>>      print(system.time(replicate(10, chrs[cc])))
>>
>> user  system elapsed
>> 0.136   0.004   0.141
>>
>> It's fast.
>>
>> However, if I get the wrong names for the last two chromosomes (chr23 and
>> chr24 instead of chrX and chrY)
>>
>>       cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
>>          rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>>          rep("chr23", 17498), rep("chr24", 1296))
>>       print(system.time(replicate(10, chrs[cc2])))
>>
>> user  system elapsed
>> 144.672   0.012 144.675
>>
>>
>> It is MUCH slower. (1000x)
>>
>>
>> BUT, if I shuffle the elements in the second vector
>>
>>      cc3 <- sample(cc2, length(cc), replace = FALSE)
>>      print(system.time(replicate(10, chrs[cc3])))
>>
>> user  system elapsed
>> 0.096   0.004   0.102
>>
>> It's fast again!!!
>>
>>
>>
>> The elapsed time is related to the number of elements BEFORE the failing
>> names,
>>
>>      cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296))
>>      print(system.time(replicate(10, chrs[cc4])))
>>
>> user  system elapsed
>> 17.332   0.004  17.336
>>
>>      cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
>>      print(system.time(replicate(10, chrs[cc5])))
>>
>> user  system elapsed
>> 1.872   0.000   1.901
>>
>>
>> so my guess is that it might come from moving around the vector in memory
>> for each "failed" selection or something similar...
>>
>> Is it correct? Is there anything I'm missing?
>>
>> Thanks a lot
>>
>> Bernat
>>
>> --
>>
>> *Bernat Gel Moreno*
>> Bioinformatician
>>
>> Hereditary Cancer Program
>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>> Germans Trias i Pujol Research Institute (IGTP)
>>
>> Campus Can Ruti
>> Carretera de Can Ruti, Camí de les Escoles s/n
>> 08916 Badalona, Barcelona, Spain
>>
>> Tel: (+34) 93 554 3068
>> Fax: (+34) 93 497 8654
>> 08916 Badalona, Barcelona, Spain
>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>> www.germanstrias.org <http://www.germanstrias.org/>
>>
>> <http://www.germanstrias.org/>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel