[Bioc-devel] Help understanding an R performance issue

Fri Jun 30 12:32:47 CEST 2017

Ok, so it seems more like a bug somewhere than something I falied to 
understand, then.

One of the surprises for me is that shuffling the data so the misses do 
not happen one after the other seems to solve the issue...

Thanks,

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
bgel at igtp.cat <mailto:bgel at igtp.cat>
www.germanstrias.org <http://www.germanstrias.org/>

<http://www.germanstrias.org/>

El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió:
> Hi Bernat, Michael,
>
> FWIW I reported this issue on R-devel a couple of times. Last time was
> in 2013:
>
>   https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html
>
> Cheers,
> H.
>
> On 06/29/2017 11:58 PM, Bernat Gel wrote:
>> Yes, that would explain part of the situation. But example cc5 shows
>> that hash misses would account only for part of the time.
>>
>> Thanks for taking a look into it
>>
>> Bernat
>>
>> *Bernat Gel Moreno*
>> Bioinformatician
>>
>> Hereditary Cancer Program
>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>> Germans Trias i Pujol Research Institute (IGTP)
>>
>> Campus Can Ruti
>> Carretera de Can Ruti, Camí de les Escoles s/n
>> 08916 Badalona, Barcelona, Spain
>>
>> Tel: (+34) 93 554 3068
>> Fax: (+34) 93 497 8654
>> 08916 Badalona, Barcelona, Spain
>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>> www.germanstrias.org
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= 
>>
>>  >
>>
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= 
>>
>>  >
>>
>>
>>
>>
>>
>>
>>
>> El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:
>>> Preliminary analysis suggests that this is due to hash misses. When
>>> that happens, R ends up doing costly string comparisons that are on
>>> the order of n^2 where 'n' is the length of the subscript. Looking
>>> into it.
>>>
>>> On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
>>>> Hi all,
>>>>
>>>> This is not strictly a Bioconductor question, but I hope some of the
>>>> experts
>>>> here can help me understand what's going on with a performance issue
>>>> I've
>>>> found working on a package.
>>>>
>>>> It has to do with selecting elements from a named vector.
>>>>
>>>> If we have a vector with the names of the chromosomes and their order
>>>>
>>>>      chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>>>>      chrs
>>>>
>>>> chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
>>>> chr12 chr13
>>>> chr14 chr15 chr16 chr17
>>>>      1     2     3     4     5     6     7     8     9 10    11
>>>> 12    13
>>>> 14    15    16    17
>>>> chr18 chr19 chr20 chr21 chr22  chrX  chrY
>>>>     18    19    20    21    22    23    24
>>>>
>>>> And we have a second vector of chromosomes (in this case, the
>>>> chromosomes
>>>> from SNP-array probes)
>>>> And we want to use the second vector to select from the first one by
>>>> name
>>>>
>>>>      cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>>> 14726),
>>>>          rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 
>>>> 10252),
>>>>          rep("chrX", 17498), rep("chrY", 1296))
>>>>      print(system.time(replicate(10, chrs[cc])))
>>>>
>>>> user  system elapsed
>>>> 0.136   0.004   0.141
>>>>
>>>> It's fast.
>>>>
>>>> However, if I get the wrong names for the last two chromosomes (chr23
>>>> and
>>>> chr24 instead of chrX and chrY)
>>>>
>>>>       cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>>> 14726),
>>>>          rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 
>>>> 10252),
>>>>          rep("chr23", 17498), rep("chr24", 1296))
>>>>       print(system.time(replicate(10, chrs[cc2])))
>>>>
>>>> user  system elapsed
>>>> 144.672   0.012 144.675
>>>>
>>>>
>>>> It is MUCH slower. (1000x)
>>>>
>>>>
>>>> BUT, if I shuffle the elements in the second vector
>>>>
>>>>      cc3 <- sample(cc2, length(cc), replace = FALSE)
>>>>      print(system.time(replicate(10, chrs[cc3])))
>>>>
>>>> user  system elapsed
>>>> 0.096   0.004   0.102
>>>>
>>>> It's fast again!!!
>>>>
>>>>
>>>>
>>>> The elapsed time is related to the number of elements BEFORE the 
>>>> failing
>>>> names,
>>>>
>>>>      cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24",
>>>> 1296))
>>>>      print(system.time(replicate(10, chrs[cc4])))
>>>>
>>>> user  system elapsed
>>>> 17.332   0.004  17.336
>>>>
>>>>      cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
>>>>      print(system.time(replicate(10, chrs[cc5])))
>>>>
>>>> user  system elapsed
>>>> 1.872   0.000   1.901
>>>>
>>>>
>>>> so my guess is that it might come from moving around the vector in
>>>> memory
>>>> for each "failed" selection or something similar...
>>>>
>>>> Is it correct? Is there anything I'm missing?
>>>>
>>>> Thanks a lot
>>>>
>>>> Bernat
>>>>
>>>> -- 
>>>>
>>>> *Bernat Gel Moreno*
>>>> Bioinformatician
>>>>
>>>> Hereditary Cancer Program
>>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>>>> Germans Trias i Pujol Research Institute (IGTP)
>>>>
>>>> Campus Can Ruti
>>>> Carretera de Can Ruti, Camí de les Escoles s/n
>>>> 08916 Badalona, Barcelona, Spain
>>>>
>>>> Tel: (+34) 93 554 3068
>>>> Fax: (+34) 93 497 8654
>>>> 08916 Badalona, Barcelona, Spain
>>>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>>>> www.germanstrias.org
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= 
>>>>
>>>> >
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= 
>>>>
>>>> >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e= 
>>>>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e= 
>>
>>
>