[Bioc-devel] Help understanding an R performance issue

Michael Lawrence lawrence.michael at gene.com
Fri Jun 30 15:20:25 CEST 2017


The reason it's faster when shuffled vs. all that end is that when a
miss happens R compares the string to all strings before it in the
subscript. So it's a lot worse to have a miss towards the end.

As Martin wrote, there are basically two possible improvements that
are somewhat complementary:
1) Tell stringSubscript() that it is not replacing so there is no need
to do that scan. This would require passing an argument down the call
stack.
2) Do a self match on the subscript like in Martin's patch, although
it should probably be done lazily on the first miss.

Michael

On Fri, Jun 30, 2017 at 3:32 AM, Bernat Gel <bgel at igtp.cat> wrote:
> Ok, so it seems more like a bug somewhere than something I falied to
> understand, then.
>
> One of the surprises for me is that shuffling the data so the misses do not
> happen one after the other seems to solve the issue...
>
> Thanks,
>
> Bernat
>
> *Bernat Gel Moreno*
> Bioinformatician
>
> Hereditary Cancer Program
> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
> Germans Trias i Pujol Research Institute (IGTP)
>
> Campus Can Ruti
> Carretera de Can Ruti, Camí de les Escoles s/n
> 08916 Badalona, Barcelona, Spain
>
> Tel: (+34) 93 554 3068
> Fax: (+34) 93 497 8654
> 08916 Badalona, Barcelona, Spain
> bgel at igtp.cat <mailto:bgel at igtp.cat>
> www.germanstrias.org <http://www.germanstrias.org/>
>
> <http://www.germanstrias.org/>
>
>
>
>
>
>
>
>
> El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió:
>>
>> Hi Bernat, Michael,
>>
>> FWIW I reported this issue on R-devel a couple of times. Last time was
>> in 2013:
>>
>>   https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html
>>
>> Cheers,
>> H.
>>
>> On 06/29/2017 11:58 PM, Bernat Gel wrote:
>>>
>>> Yes, that would explain part of the situation. But example cc5 shows
>>> that hash misses would account only for part of the time.
>>>
>>> Thanks for taking a look into it
>>>
>>> Bernat
>>>
>>> *Bernat Gel Moreno*
>>> Bioinformatician
>>>
>>> Hereditary Cancer Program
>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>>> Germans Trias i Pujol Research Institute (IGTP)
>>>
>>> Campus Can Ruti
>>> Carretera de Can Ruti, Camí de les Escoles s/n
>>> 08916 Badalona, Barcelona, Spain
>>>
>>> Tel: (+34) 93 554 3068
>>> Fax: (+34) 93 497 8654
>>> 08916 Badalona, Barcelona, Spain
>>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>>> www.germanstrias.org
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>  >
>>>
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>  >
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:
>>>>
>>>> Preliminary analysis suggests that this is due to hash misses. When
>>>> that happens, R ends up doing costly string comparisons that are on
>>>> the order of n^2 where 'n' is the length of the subscript. Looking
>>>> into it.
>>>>
>>>> On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This is not strictly a Bioconductor question, but I hope some of the
>>>>> experts
>>>>> here can help me understand what's going on with a performance issue
>>>>> I've
>>>>> found working on a package.
>>>>>
>>>>> It has to do with selecting elements from a named vector.
>>>>>
>>>>> If we have a vector with the names of the chromosomes and their order
>>>>>
>>>>>      chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>>>>>      chrs
>>>>>
>>>>> chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
>>>>> chr12 chr13
>>>>> chr14 chr15 chr16 chr17
>>>>>      1     2     3     4     5     6     7     8     9 10    11
>>>>> 12    13
>>>>> 14    15    16    17
>>>>> chr18 chr19 chr20 chr21 chr22  chrX  chrY
>>>>>     18    19    20    21    22    23    24
>>>>>
>>>>> And we have a second vector of chromosomes (in this case, the
>>>>> chromosomes
>>>>> from SNP-array probes)
>>>>> And we want to use the second vector to select from the first one by
>>>>> name
>>>>>
>>>>>      cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>>>> 14726),
>>>>>          rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>>>>>          rep("chrX", 17498), rep("chrY", 1296))
>>>>>      print(system.time(replicate(10, chrs[cc])))
>>>>>
>>>>> user  system elapsed
>>>>> 0.136   0.004   0.141
>>>>>
>>>>> It's fast.
>>>>>
>>>>> However, if I get the wrong names for the last two chromosomes (chr23
>>>>> and
>>>>> chr24 instead of chrX and chrY)
>>>>>
>>>>>       cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>>>> 14726),
>>>>>          rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>>>>>          rep("chr23", 17498), rep("chr24", 1296))
>>>>>       print(system.time(replicate(10, chrs[cc2])))
>>>>>
>>>>> user  system elapsed
>>>>> 144.672   0.012 144.675
>>>>>
>>>>>
>>>>> It is MUCH slower. (1000x)
>>>>>
>>>>>
>>>>> BUT, if I shuffle the elements in the second vector
>>>>>
>>>>>      cc3 <- sample(cc2, length(cc), replace = FALSE)
>>>>>      print(system.time(replicate(10, chrs[cc3])))
>>>>>
>>>>> user  system elapsed
>>>>> 0.096   0.004   0.102
>>>>>
>>>>> It's fast again!!!
>>>>>
>>>>>
>>>>>
>>>>> The elapsed time is related to the number of elements BEFORE the
>>>>> failing
>>>>> names,
>>>>>
>>>>>      cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24",
>>>>> 1296))
>>>>>      print(system.time(replicate(10, chrs[cc4])))
>>>>>
>>>>> user  system elapsed
>>>>> 17.332   0.004  17.336
>>>>>
>>>>>      cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
>>>>>      print(system.time(replicate(10, chrs[cc5])))
>>>>>
>>>>> user  system elapsed
>>>>> 1.872   0.000   1.901
>>>>>
>>>>>
>>>>> so my guess is that it might come from moving around the vector in
>>>>> memory
>>>>> for each "failed" selection or something similar...
>>>>>
>>>>> Is it correct? Is there anything I'm missing?
>>>>>
>>>>> Thanks a lot
>>>>>
>>>>> Bernat
>>>>>
>>>>> --
>>>>>
>>>>> *Bernat Gel Moreno*
>>>>> Bioinformatician
>>>>>
>>>>> Hereditary Cancer Program
>>>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>>>>> Germans Trias i Pujol Research Institute (IGTP)
>>>>>
>>>>> Campus Can Ruti
>>>>> Carretera de Can Ruti, Camí de les Escoles s/n
>>>>> 08916 Badalona, Barcelona, Spain
>>>>>
>>>>> Tel: (+34) 93 554 3068
>>>>> Fax: (+34) 93 497 8654
>>>>> 08916 Badalona, Barcelona, Spain
>>>>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>>>>> www.germanstrias.org
>>>>>
>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>>> >
>>>>>
>>>>>
>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>>
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=
>>>
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list