[Bioc-devel] Help understanding an R performance issue
Bernat Gel
bgel at igtp.cat
Fri Jun 30 12:32:47 CEST 2017
Ok, so it seems more like a bug somewhere than something I falied to
understand, then.
One of the surprises for me is that shuffling the data so the misses do
not happen one after the other seems to solve the issue...
Thanks,
Bernat
*Bernat Gel Moreno*
Bioinformatician
Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)
Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain
Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
bgel at igtp.cat <mailto:bgel at igtp.cat>
www.germanstrias.org <http://www.germanstrias.org/>
<http://www.germanstrias.org/>
El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió:
> Hi Bernat, Michael,
>
> FWIW I reported this issue on R-devel a couple of times. Last time was
> in 2013:
>
> https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html
>
> Cheers,
> H.
>
> On 06/29/2017 11:58 PM, Bernat Gel wrote:
>> Yes, that would explain part of the situation. But example cc5 shows
>> that hash misses would account only for part of the time.
>>
>> Thanks for taking a look into it
>>
>> Bernat
>>
>> *Bernat Gel Moreno*
>> Bioinformatician
>>
>> Hereditary Cancer Program
>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>> Germans Trias i Pujol Research Institute (IGTP)
>>
>> Campus Can Ruti
>> Carretera de Can Ruti, Camí de les Escoles s/n
>> 08916 Badalona, Barcelona, Spain
>>
>> Tel: (+34) 93 554 3068
>> Fax: (+34) 93 497 8654
>> 08916 Badalona, Barcelona, Spain
>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>> www.germanstrias.org
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>
>> >
>>
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>
>> >
>>
>>
>>
>>
>>
>>
>>
>> El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:
>>> Preliminary analysis suggests that this is due to hash misses. When
>>> that happens, R ends up doing costly string comparisons that are on
>>> the order of n^2 where 'n' is the length of the subscript. Looking
>>> into it.
>>>
>>> On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <bgel at igtp.cat> wrote:
>>>> Hi all,
>>>>
>>>> This is not strictly a Bioconductor question, but I hope some of the
>>>> experts
>>>> here can help me understand what's going on with a performance issue
>>>> I've
>>>> found working on a package.
>>>>
>>>> It has to do with selecting elements from a named vector.
>>>>
>>>> If we have a vector with the names of the chromosomes and their order
>>>>
>>>> chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
>>>> chrs
>>>>
>>>> chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
>>>> chr12 chr13
>>>> chr14 chr15 chr16 chr17
>>>> 1 2 3 4 5 6 7 8 9 10 11
>>>> 12 13
>>>> 14 15 16 17
>>>> chr18 chr19 chr20 chr21 chr22 chrX chrY
>>>> 18 19 20 21 22 23 24
>>>>
>>>> And we have a second vector of chromosomes (in this case, the
>>>> chromosomes
>>>> from SNP-array probes)
>>>> And we want to use the second vector to select from the first one by
>>>> name
>>>>
>>>> cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>>> 14726),
>>>> rep("chr20", 18135), rep("chr21", 10068), rep("chr22",
>>>> 10252),
>>>> rep("chrX", 17498), rep("chrY", 1296))
>>>> print(system.time(replicate(10, chrs[cc])))
>>>>
>>>> user system elapsed
>>>> 0.136 0.004 0.141
>>>>
>>>> It's fast.
>>>>
>>>> However, if I get the wrong names for the last two chromosomes (chr23
>>>> and
>>>> chr24 instead of chrX and chrY)
>>>>
>>>> cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
>>>> 14726),
>>>> rep("chr20", 18135), rep("chr21", 10068), rep("chr22",
>>>> 10252),
>>>> rep("chr23", 17498), rep("chr24", 1296))
>>>> print(system.time(replicate(10, chrs[cc2])))
>>>>
>>>> user system elapsed
>>>> 144.672 0.012 144.675
>>>>
>>>>
>>>> It is MUCH slower. (1000x)
>>>>
>>>>
>>>> BUT, if I shuffle the elements in the second vector
>>>>
>>>> cc3 <- sample(cc2, length(cc), replace = FALSE)
>>>> print(system.time(replicate(10, chrs[cc3])))
>>>>
>>>> user system elapsed
>>>> 0.096 0.004 0.102
>>>>
>>>> It's fast again!!!
>>>>
>>>>
>>>>
>>>> The elapsed time is related to the number of elements BEFORE the
>>>> failing
>>>> names,
>>>>
>>>> cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24",
>>>> 1296))
>>>> print(system.time(replicate(10, chrs[cc4])))
>>>>
>>>> user system elapsed
>>>> 17.332 0.004 17.336
>>>>
>>>> cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
>>>> print(system.time(replicate(10, chrs[cc5])))
>>>>
>>>> user system elapsed
>>>> 1.872 0.000 1.901
>>>>
>>>>
>>>> so my guess is that it might come from moving around the vector in
>>>> memory
>>>> for each "failed" selection or something similar...
>>>>
>>>> Is it correct? Is there anything I'm missing?
>>>>
>>>> Thanks a lot
>>>>
>>>> Bernat
>>>>
>>>> --
>>>>
>>>> *Bernat Gel Moreno*
>>>> Bioinformatician
>>>>
>>>> Hereditary Cancer Program
>>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC)
>>>> Germans Trias i Pujol Research Institute (IGTP)
>>>>
>>>> Campus Can Ruti
>>>> Carretera de Can Ruti, Camí de les Escoles s/n
>>>> 08916 Badalona, Barcelona, Spain
>>>>
>>>> Tel: (+34) 93 554 3068
>>>> Fax: (+34) 93 497 8654
>>>> 08916 Badalona, Barcelona, Spain
>>>> bgel at igtp.cat <mailto:bgel at igtp.cat>
>>>> www.germanstrias.org
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>>
>>>> >
>>>>
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>>>>
>>>> >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=
>>>>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=
>>
>>
>
More information about the Bioc-devel
mailing list