[Bioc-devel] Help understanding an R performance issue
Bernat Gel
bgel at igtp.cat
Fri Jun 30 08:58:16 CEST 2017
Yes, that would explain part of the situation. But example cc5 shows
that hash misses would account only for part of the time.
Thanks for taking a look into it
Bernat
*Bernat Gel Moreno*
Bioinformatician
Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)
Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain
Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
bgel at igtp.cat <mailto:bgel at igtp.cat>
www.germanstrias.org <http://www.germanstrias.org/>
<http://www.germanstrias.org/>
El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:
> Preliminary analysis suggests that this is due to hash misses. When
> that happens, R ends up doing costly string comparisons that are on
> the order of n^2 where 'n' is the length of the subscript. Looking
> into it.
>
> On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
>> Hi all,
>>
>> This is not strictly a Bioconductor question, but I hope some of the experts
>> here can help me understand what's going on with a performance issue I've
>> found working on a package.
>>
>> It has to do with selecting elements from a named vector.
>>
>> If we have a vector with the names of the chromosomes and their order
>>
>> chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>> chrs
>>
>> chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13
>> chr14 chr15 chr16 chr17
>> 1 2 3 4 5 6 7 8 9 10 11 12 13
>> 14 15 16 17
>> chr18 chr19 chr20 chr21 chr22 chrX chrY
>> 18 19 20 21 22 23 24
>>
>> And we have a second vector of chromosomes (in this case, the chromosomes
>> from SNP-array probes)
>> And we want to use the second vector to select from the first one by name
>>
>> cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
>> rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>> rep("chrX", 17498), rep("chrY", 1296))
>> print(system.time(replicate(10, chrs[cc])))
>>
>> user system elapsed
>> 0.136 0.004 0.141
>>
>> It's fast.
>>
>> However, if I get the wrong names for the last two chromosomes (chr23 and
>> chr24 instead of chrX and chrY)
>>
>> cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
>> rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>> rep("chr23", 17498), rep("chr24", 1296))
>> print(system.time(replicate(10, chrs[cc2])))
>>
>> user system elapsed
>> 144.672 0.012 144.675
>>
>>
>> It is MUCH slower. (1000x)
>>
>>
>> BUT, if I shuffle the elements in the second vector
>>
>> cc3 <- sample(cc2, length(cc), replace = FALSE)
>> print(system.time(replicate(10, chrs[cc3])))
>>
>> user system elapsed
>> 0.096 0.004 0.102
>>
>> It's fast again!!!
>>
>>
>>
>> The elapsed time is related to the number of elements BEFORE the failing
>> names,
>>
>> cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296))
>> print(system.time(replicate(10, chrs[cc4])))
>>
>> user system elapsed
>> 17.332 0.004 17.336
>>
>> cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
>> print(system.time(replicate(10, chrs[cc5])))
>>
>> user system elapsed
>> 1.872 0.000 1.901
>>
>>
>> so my guess is that it might come from moving around the vector in memory
>> for each "failed" selection or something similar...
>>
>> Is it correct? Is there anything I'm missing?
>>
>> Thanks a lot
>>
>> Bernat
>>
>> --
>>
>> *Bernat Gel Moreno*
>> Bioinformatician
>>
>> Hereditary Cancer Program
>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>> Germans Trias i Pujol Research Institute (IGTP)
>>
>> Campus Can Ruti
>> Carretera de Can Ruti, Camí de les Escoles s/n
>> 08916 Badalona, Barcelona, Spain
>>
>> Tel: (+34) 93 554 3068
>> Fax: (+34) 93 497 8654
>> 08916 Badalona, Barcelona, Spain
>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>> www.germanstrias.org <http://www.germanstrias.org/>
>>
>> <http://www.germanstrias.org/>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list