[Bioc-devel] Help understanding an R performance issue

Fri Jun 30 15:44:18 CEST 2017

Ok, that makes sense

In my current use case I think I'll be able to filter out first the 
elements that will miss, so this behaviour is not triggered.

But it's good to know this happens so I can try to avoid it in the future.

Thanks.

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
bgel at igtp.cat <mailto:bgel at igtp.cat>
www.germanstrias.org <http://www.germanstrias.org/>

<http://www.germanstrias.org/>

El 06/30/2017 a las 03:20 PM, Michael Lawrence escribió:
> The reason it's faster when shuffled vs. all that end is that when a
> miss happens R compares the string to all strings before it in the
> subscript. So it's a lot worse to have a miss towards the end.
>
> As Martin wrote, there are basically two possible improvements that
> are somewhat complementary:
> 1) Tell stringSubscript() that it is not replacing so there is no need
> to do that scan. This would require passing an argument down the call
> stack.
> 2) Do a self match on the subscript like in Martin's patch, although
> it should probably be done lazily on the first miss.
>
> Michael
>
> On Fri, Jun 30, 2017 at 3:32 AM, Bernat Gel <bgel at igtp.cat> wrote:
>> Ok, so it seems more like a bug somewhere than something I falied to
>> understand, then.
>>
>> One of the surprises for me is that shuffling the data so the misses do not
>> happen one after the other seems to solve the issue...
>>
>> Thanks,
>>
>> Bernat
>>
>> *Bernat Gel Moreno*
>> Bioinformatician
>>
>> Hereditary Cancer Program
>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>> Germans Trias i Pujol Research Institute (IGTP)
>>
>> Campus Can Ruti
>> Carretera de Can Ruti, Camí de les Escoles s/n
>> 08916 Badalona, Barcelona, Spain
>>
>> Tel: (+34) 93 554 3068
>> Fax: (+34) 93 497 8654
>> 08916 Badalona, Barcelona, Spain
>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>> www.germanstrias.org <http://www.germanstrias.org/>
>>
>> <http://www.germanstrias.org/>
>>
>>
>>
>>
>>
>>
>>
>>
>> El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió:
>>> Hi Bernat, Michael,
>>>
>>> FWIW I reported this issue on R-devel a couple of times. Last time was
>>> in 2013:
>>>
>>>    https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html
>>>
>>> Cheers,
>>> H.
>>>
>>> On 06/29/2017 11:58 PM, Bernat Gel wrote:
>>>> Yes, that would explain part of the situation. But example cc5 shows
>>>> that hash misses would account only for part of the time.
>>>>
>>>> Thanks for taking a look into it
>>>>
>>>> Bernat
>>>>
>>>> *Bernat Gel Moreno*
>>>> Bioinformatician
>>>>
>>>> Hereditary Cancer Program
>>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>>>> Germans Trias i Pujol Research Institute (IGTP)
>>>>
>>>> Campus Can Ruti
>>>> Carretera de Can Ruti, Camí de les Escoles s/n
>>>> 08916 Badalona, Barcelona, Spain
>>>>
>>>> Tel: (+34) 93 554 3068
>>>> Fax: (+34) 93 497 8654
>>>> 08916 Badalona, Barcelona, Spain
>>>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>>>> www.germanstrias.org
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>>   >
>>>>
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>>   >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:
>>>>> Preliminary analysis suggests that this is due to hash misses. When
>>>>> that happens, R ends up doing costly string comparisons that are on
>>>>> the order of n^2 where 'n' is the length of the subscript. Looking
>>>>> into it.
>>>>>
>>>>> On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> This is not strictly a Bioconductor question, but I hope some of the
>>>>>> experts
>>>>>> here can help me understand what's going on with a performance issue
>>>>>> I've
>>>>>> found working on a package.
>>>>>>
>>>>>> It has to do with selecting elements from a named vector.
>>>>>>
>>>>>> If we have a vector with the names of the chromosomes and their order
>>>>>>
>>>>>>       chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>>>>>>       chrs
>>>>>>
>>>>>> chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
>>>>>> chr12 chr13
>>>>>> chr14 chr15 chr16 chr17
>>>>>>       1     2     3     4     5     6     7     8     9 10    11
>>>>>> 12    13
>>>>>> 14    15    16    17
>>>>>> chr18 chr19 chr20 chr21 chr22  chrX  chrY
>>>>>>      18    19    20    21    22    23    24
>>>>>>
>>>>>> And we have a second vector of chromosomes (in this case, the
>>>>>> chromosomes
>>>>>> from SNP-array probes)
>>>>>> And we want to use the second vector to select from the first one by
>>>>>> name
>>>>>>
>>>>>>       cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>>>>> 14726),
>>>>>>           rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>>>>>>           rep("chrX", 17498), rep("chrY", 1296))
>>>>>>       print(system.time(replicate(10, chrs[cc])))
>>>>>>
>>>>>> user  system elapsed
>>>>>> 0.136   0.004   0.141
>>>>>>
>>>>>> It's fast.
>>>>>>
>>>>>> However, if I get the wrong names for the last two chromosomes (chr23
>>>>>> and
>>>>>> chr24 instead of chrX and chrY)
>>>>>>
>>>>>>        cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>>>>> 14726),
>>>>>>           rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>>>>>>           rep("chr23", 17498), rep("chr24", 1296))
>>>>>>        print(system.time(replicate(10, chrs[cc2])))
>>>>>>
>>>>>> user  system elapsed
>>>>>> 144.672   0.012 144.675
>>>>>>
>>>>>>
>>>>>> It is MUCH slower. (1000x)
>>>>>>
>>>>>>
>>>>>> BUT, if I shuffle the elements in the second vector
>>>>>>
>>>>>>       cc3 <- sample(cc2, length(cc), replace = FALSE)
>>>>>>       print(system.time(replicate(10, chrs[cc3])))
>>>>>>
>>>>>> user  system elapsed
>>>>>> 0.096   0.004   0.102
>>>>>>
>>>>>> It's fast again!!!
>>>>>>
>>>>>>
>>>>>>
>>>>>> The elapsed time is related to the number of elements BEFORE the
>>>>>> failing
>>>>>> names,
>>>>>>
>>>>>>       cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24",
>>>>>> 1296))
>>>>>>       print(system.time(replicate(10, chrs[cc4])))
>>>>>>
>>>>>> user  system elapsed
>>>>>> 17.332   0.004  17.336
>>>>>>
>>>>>>       cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
>>>>>>       print(system.time(replicate(10, chrs[cc5])))
>>>>>>
>>>>>> user  system elapsed
>>>>>> 1.872   0.000   1.901
>>>>>>
>>>>>>
>>>>>> so my guess is that it might come from moving around the vector in
>>>>>> memory
>>>>>> for each "failed" selection or something similar...
>>>>>>
>>>>>> Is it correct? Is there anything I'm missing?
>>>>>>
>>>>>> Thanks a lot
>>>>>>
>>>>>> Bernat
>>>>>>
>>>>>> --
>>>>>>
>>>>>> *Bernat Gel Moreno*
>>>>>> Bioinformatician
>>>>>>
>>>>>> Hereditary Cancer Program
>>>>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>>>>>> Germans Trias i Pujol Research Institute (IGTP)
>>>>>>
>>>>>> Campus Can Ruti
>>>>>> Carretera de Can Ruti, Camí de les Escoles s/n
>>>>>> 08916 Badalona, Barcelona, Spain
>>>>>>
>>>>>> Tel: (+34) 93 554 3068
>>>>>> Fax: (+34) 93 497 8654
>>>>>> 08916 Badalona, Barcelona, Spain
>>>>>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>>>>>> www.germanstrias.org
>>>>>>
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>>>>
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=
>>>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel