[Rd] Suggestion: 20% speed up of which() with two-character mod
Henrik Bengtsson
hb at stat.berkeley.edu
Tue Aug 5 17:45:07 CEST 2008
Hi,
thanks for this. I'll use "unified" diff next time, i.e.
diff -u <current>.R <new>.R
/Henrik
On Tue, Aug 5, 2008 at 5:54 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
>>>>>> "HenrikB" == Henrik Bengtsson <hb at stat.berkeley.edu>
>>>>>> on Mon, 4 Aug 2008 21:14:12 -0700 writes:
>
> HenrikB> Hi,
>
> HenrikB> I just want to do a follow up this very simple
> HenrikB> fix/correction/speedup/cleanup of the base::which() function. Here is
> HenrikB> a diff:
>
> HenrikB> diff src/library/base/R/which.R which.R
> HenrikB> 21c21
> HenrikB> < wh <- seq_along(x)[ll <- x & !is.na(x)]
> HenrikB> ---
> >> wh <- seq_along(x)[x & !is.na(x)]
> HenrikB> 25c25
> HenrikB> < names(wh) <- names(x)[ll]
> HenrikB> ---
> >> names(wh) <- names(x)[wh]
>
> HenrikB> FYI, the 'll' variable is not used elsewhere. I've been going through
> HenrikB> this modifications several times and I cannot see any side effects.
>
> HenrikB> Could someone of R core please commit this?
>
> I had added your proposition to my version of R-devel in order
> to commit it, and had wanted to do my own performance tests
> under different scenarios, but I had forgotten / postponed it.
> {I have more such things , notably the "help.request() from Kate
> Mullen -- with quite a few of my own changes, not quite
> finished ... that will have to wait for after useR!2008 ..}
>
> In fact, it seems is pretty obvious that the version with [wh]
> instead of [ll] should be faster in most cases, and never
> slower,
> and so I do commit it now.
>
> Thank you Henrik, for the reminder.
>
> Martin
>
> HenrikB> BTW, when one report diff:s, do you prefer to get it with or without
> HenrikB> context information, e.g. -C 3?
>
> {My exact preference would depend on the size / style of the
> patch itself. It does not really matter, and as a general rule,
> I'd personally prefer '-u' (unified diffs which include context)}
>
> HenrikB> /Henrik
>
> HenrikB> On Fri, Jul 11, 2008 at 8:57 AM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
> >> On Thu, 10 Jul 2008, Henrik Bengtsson wrote:
> >>
> >>> Hi,
> >>>
> >>> by replacing 'll' with 'wh' in the source code for base::which() one
> >>> gets ~20% speed up for *named logical vectors*.
> >>
> >>
> >> The amount of speedup depends on how sparse the TRUE values are.
> >>
> >> When the proportion of TRUEs gets small the speedup is more than twofold on
> >> my macbook. For high proportions of TRUE, the speedup is more like the 20%
> >> you cite.
> >>
> >> HTH,
> >>
> >> Chuck
> >>
> >>>
> >>> CURRENT CODE:
> >>>
> >>> which <- function(x, arr.ind = FALSE)
> >>> {
> >>> if(!is.logical(x))
> >>> stop("argument to 'which' is not logical")
> >>> wh <- seq_along(x)[ll <- x & !is.na(x)]
> >>> m <- length(wh)
> >>> dl <- dim(x)
> >>> if (is.null(dl) || !arr.ind) {
> >>> names(wh) <- names(x)[ll]
> >>> }
> >>> ...
> >>> wh;
> >>> }
> >>>
> >>> SUGGESTED CODE: (Remove 'll' and use 'wh')
> >>>
> >>> which2 <- function(x, arr.ind = FALSE)
> >>> {
> >>> if(!is.logical(x))
> >>> stop("argument to 'which' is not logical")
> >>> wh <- seq_along(x)[x & !is.na(x)]
> >>> m <- length(wh)
> >>> dl <- dim(x)
> >>> if (is.null(dl) || !arr.ind) {
> >>> names(wh) <- names(x)[wh]
> >>> }
> >>> ...
> >>> wh;
> >>> }
> >>>
> >>> That's all.
> >>>
> >>> BENCHMARKING:
> >>>
> >>> # To measure both in same environment
> >>> which1 <- base::which;
> >>> environment(which1) <- globalenv(); # Needed?
> >>>
> >>> N <- 1e6;
> >>> set.seed(0xbeef);
> >>> x <- sample(c(TRUE, FALSE), size=N, replace=TRUE);
> >>> names(x) <- seq_along(x);
> >>> B <- 10;
> >>> t1 <- system.time({ for (bb in 1:B) idxs1 <- which1(x); });
> >>> t2 <- system.time({ for (bb in 1:B) idxs2 <- which2(x); });
> >>> stopifnot(identical(idxs1, idxs2));
> >>> print(t1/t2);
> >>> # Fair benchmarking
> >>> t2 <- system.time({ for (bb in 1:B) idxs2 <- which2(x); });
> >>> t1 <- system.time({ for (bb in 1:B) idxs1 <- which1(x); });
> >>> print(t1/t2);
> >>> ## user system elapsed
> >>> ## 1.283186 1.052632 1.250000
> >>>
> >>> You get similar results if you put for loop outside the system.time()
> >>> call (and sum up the timings).
> >>>
> >>> Cheers
> >>>
> >>> Henrik
> >>>
> >>> ______________________________________________
> >>> R-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >>
> >> Charles C. Berry (858) 534-2098
> >> Dept of Family/Preventive
> >> Medicine
> >> E mailto:cberry at tajo.ucsd.edu UC San Diego
> >> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
> >>
> >>
> >>
>
> HenrikB> ______________________________________________
> HenrikB> R-devel at r-project.org mailing list
> HenrikB> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list