[Bioc-devel] Help understanding an R performance issue

Hervé Pagès hpages at fredhutch.org
Fri Jun 30 11:21:39 CEST 2017


Hi Bernat, Michael,

FWIW I reported this issue on R-devel a couple of times. Last time was
in 2013:

   https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html

Cheers,
H.

On 06/29/2017 11:58 PM, Bernat Gel wrote:
> Yes, that would explain part of the situation. But example cc5 shows
> that hash misses would account only for part of the time.
>
> Thanks for taking a look into it
>
> Bernat
>
> *Bernat Gel Moreno*
> Bioinformatician
>
> Hereditary Cancer Program
> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
> Germans Trias i Pujol Research Institute (IGTP)
>
> Campus Can Ruti
> Carretera de Can Ruti, Camí de les Escoles s/n
> 08916 Badalona, Barcelona, Spain
>
> Tel: (+34) 93 554 3068
> Fax: (+34) 93 497 8654
> 08916 Badalona, Barcelona, Spain
> bgel at igtp.cat <mailto:bgel at igtp.cat>
> www.germanstrias.org
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>  >
>
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>  >
>
>
>
>
>
>
>
> El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:
>> Preliminary analysis suggests that this is due to hash misses. When
>> that happens, R ends up doing costly string comparisons that are on
>> the order of n^2 where 'n' is the length of the subscript. Looking
>> into it.
>>
>> On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
>>> Hi all,
>>>
>>> This is not strictly a Bioconductor question, but I hope some of the
>>> experts
>>> here can help me understand what's going on with a performance issue
>>> I've
>>> found working on a package.
>>>
>>> It has to do with selecting elements from a named vector.
>>>
>>> If we have a vector with the names of the chromosomes and their order
>>>
>>>      chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>>>      chrs
>>>
>>> chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
>>> chr12 chr13
>>> chr14 chr15 chr16 chr17
>>>      1     2     3     4     5     6     7     8     9    10    11
>>> 12    13
>>> 14    15    16    17
>>> chr18 chr19 chr20 chr21 chr22  chrX  chrY
>>>     18    19    20    21    22    23    24
>>>
>>> And we have a second vector of chromosomes (in this case, the
>>> chromosomes
>>> from SNP-array probes)
>>> And we want to use the second vector to select from the first one by
>>> name
>>>
>>>      cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>> 14726),
>>>          rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>>>          rep("chrX", 17498), rep("chrY", 1296))
>>>      print(system.time(replicate(10, chrs[cc])))
>>>
>>> user  system elapsed
>>> 0.136   0.004   0.141
>>>
>>> It's fast.
>>>
>>> However, if I get the wrong names for the last two chromosomes (chr23
>>> and
>>> chr24 instead of chrX and chrY)
>>>
>>>       cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>> 14726),
>>>          rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
>>>          rep("chr23", 17498), rep("chr24", 1296))
>>>       print(system.time(replicate(10, chrs[cc2])))
>>>
>>> user  system elapsed
>>> 144.672   0.012 144.675
>>>
>>>
>>> It is MUCH slower. (1000x)
>>>
>>>
>>> BUT, if I shuffle the elements in the second vector
>>>
>>>      cc3 <- sample(cc2, length(cc), replace = FALSE)
>>>      print(system.time(replicate(10, chrs[cc3])))
>>>
>>> user  system elapsed
>>> 0.096   0.004   0.102
>>>
>>> It's fast again!!!
>>>
>>>
>>>
>>> The elapsed time is related to the number of elements BEFORE the failing
>>> names,
>>>
>>>      cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24",
>>> 1296))
>>>      print(system.time(replicate(10, chrs[cc4])))
>>>
>>> user  system elapsed
>>> 17.332   0.004  17.336
>>>
>>>      cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
>>>      print(system.time(replicate(10, chrs[cc5])))
>>>
>>> user  system elapsed
>>> 1.872   0.000   1.901
>>>
>>>
>>> so my guess is that it might come from moving around the vector in
>>> memory
>>> for each "failed" selection or something similar...
>>>
>>> Is it correct? Is there anything I'm missing?
>>>
>>> Thanks a lot
>>>
>>> Bernat
>>>
>>> --
>>>
>>> *Bernat Gel Moreno*
>>> Bioinformatician
>>>
>>> Hereditary Cancer Program
>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>>> Germans Trias i Pujol Research Institute (IGTP)
>>>
>>> Campus Can Ruti
>>> Carretera de Can Ruti, Camí de les Escoles s/n
>>> 08916 Badalona, Barcelona, Spain
>>>
>>> Tel: (+34) 93 554 3068
>>> Fax: (+34) 93 497 8654
>>> 08916 Badalona, Barcelona, Spain
>>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>>> www.germanstrias.org
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>> >
>>>
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>> >
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list