[Rd] sum() and mean() for (ALTREP) integer sequences
Duncan Murdoch
murdoch@dunc@n @end|ng |rom gm@||@com
Thu Sep 2 13:46:18 CEST 2021
On 02/09/2021 6:55 a.m., Viechtbauer, Wolfgang (SP) wrote:
> Hi all,
>
> I am trying to understand the performance of functions applied to integer sequences. Consider the following:
>
> ### begin example ###
>
> library(lobstr)
> library(microbenchmark)
>
> x <- sample(1e6)
> obj_size(x)
> # 4,000,048 B
>
> y <- 1:1e6
> obj_size(y)
> # 680 B
>
> # So we can see that 'y' uses ALTREP. These are, as expected, the same:
>
> sum(x)
> # [1] 500000500000
> sum(y)
> # [1] 500000500000
>
> # For 'x', we have to go through the trouble of actually summing up 1e6 integers.
> # For 'y', knowing its form, we really just need to do:
>
> 1e6*(1e6+1)/2
> # [1] 500000500000
>
> # which should be a whole lot faster. And indeed, it is:
>
> microbenchmark(sum(x),sum(y))
>
> # Unit: nanoseconds
> # expr min lq mean median uq max neval cld
> # sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519 100 b
> # sum(y) 183 245.5 446.09 338.5 447.0 3233 100 a
>
> # Now what about mean()?
>
> mean(x)
> # [1] 500000.5
> mean(y)
> # [1] 500000.5
>
> # which is the same as
>
> (1e6+1)/2
> # [1] 500000.5
>
> # But this surprised me:
>
> microbenchmark(mean(x),mean(y))
>
> # Unit: microseconds
> # expr min lq mean median uq max neval cld
> # mean(x) 935.389 943.4795 1021.423 954.689 985.122 2065.974 100 a
> # mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768 100 b
>
> ### end example ###
>
> So why is mean() on an ALTREP sequence slower when sum() is faster?
>
> And more generally, when using sum() on an ALTREP integer sequence, does R actually use something like n*(n+1)/2 (or generalized to sequences a:b -- (a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?
The mean.default function looks like this:
function (x, trim = 0, na.rm = FALSE, ...)
{
if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
warning("argument is not numeric or logical: returning NA")
return(NA_real_)
}
if (na.rm)
x <- x[!is.na(x)]
if (!is.numeric(trim) || length(trim) != 1L)
stop("'trim' must be numeric of length one")
n <- length(x)
if (trim > 0 && n) {
if (is.complex(x))
stop("trimmed means are not defined for complex data")
if (anyNA(x))
return(NA_real_)
if (trim >= 0.5)
return(stats::median(x, na.rm = FALSE))
lo <- floor(n * trim) + 1
hi <- n + 1 - lo
x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
}
.Internal(mean(x))
}
So it does fixups for trimming and NA removal, then calls an internal
function. The internal function is the first part of do_summary, here:
https://github.com/wch/r-source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L541-L556
It is using separate functions for the mean by type. The real_mean
function here:
https://github.com/wch/r-source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L476-L515
makes a big effort to avoid overflows.
So I suspect the reason mean.default doesn't use sum(x)/length(x) at the
end is that on a long vector sum(x) could overflow when mean(x) shouldn't.
So why not take the ALTREP into account? I suspect it's just too much
trouble for a rare case.
Duncan Murdoch
More information about the R-devel
mailing list