[Rd] sum() and mean() for (ALTREP) integer sequences
Viechtbauer, Wolfgang (SP)
wo||g@ng@v|echtb@uer @end|ng |rom m@@@tr|chtun|ver@|ty@n|
Thu Sep 2 12:55:03 CEST 2021
Hi all,
I am trying to understand the performance of functions applied to integer sequences. Consider the following:
### begin example ###
library(lobstr)
library(microbenchmark)
x <- sample(1e6)
obj_size(x)
# 4,000,048 B
y <- 1:1e6
obj_size(y)
# 680 B
# So we can see that 'y' uses ALTREP. These are, as expected, the same:
sum(x)
# [1] 500000500000
sum(y)
# [1] 500000500000
# For 'x', we have to go through the trouble of actually summing up 1e6 integers.
# For 'y', knowing its form, we really just need to do:
1e6*(1e6+1)/2
# [1] 500000500000
# which should be a whole lot faster. And indeed, it is:
microbenchmark(sum(x),sum(y))
# Unit: nanoseconds
# expr min lq mean median uq max neval cld
# sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519 100 b
# sum(y) 183 245.5 446.09 338.5 447.0 3233 100 a
# Now what about mean()?
mean(x)
# [1] 500000.5
mean(y)
# [1] 500000.5
# which is the same as
(1e6+1)/2
# [1] 500000.5
# But this surprised me:
microbenchmark(mean(x),mean(y))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# mean(x) 935.389 943.4795 1021.423 954.689 985.122 2065.974 100 a
# mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768 100 b
### end example ###
So why is mean() on an ALTREP sequence slower when sum() is faster?
And more generally, when using sum() on an ALTREP integer sequence, does R actually use something like n*(n+1)/2 (or generalized to sequences a:b -- (a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?
Best,
Wolfgang
More information about the R-devel
mailing list