[Rd] sum() and mean() for (ALTREP) integer sequences

Fri Sep 3 14:54:50 CEST 2021

Wow, thanks for this extensive reply, Andre. As Duncan mentioned, it's probably not worth the bother for such a rare case. And my curiosity as to why the timing difference was happening has been more than satisfied.

Best,
Wolfgang

>-----Original Message-----
>From: GILLIBERT, Andre [mailto:Andre.Gillibert using chu-rouen.fr]
>Sent: Thursday, 02 September, 2021 21:54
>To: Viechtbauer, Wolfgang (SP); r-devel using r-project.org
>Subject: RE: sum() and mean() for (ALTREP) integer sequences
>
>Hello,
>
>Please, find a long response below.
>
>== difference between mean(x) and sum(x)/length(x) ==
>
>At the core, mean(x) and sum(x)/length(x) works very differently for real numbers.
>Mean is more accurate when applied to a vector with a small variance but a very
>high mean, especially on platforms without extended precision numbers (i.e. long
>double is 64 bits rather than 80 bits).
>
>On a Windows 10 computer (with 80 bits long double):
>k=1e6
>base=2^51
>realmean = mean(1:k)
>sqa=(base+1):(base+k) # with ALTREP
>sq=(base+1):(base+k)+0.0 # without ALTREP
>mean(sq) - base - realmean  # correctly returns zero
>sum(sq)/k - base - realmean # incorrectly returns -0.5
>sum(sqa)/k - base - realmean # correctly returns zero
>
>On a GNU/Linux x86_64 computer with extended precision floating point disabled,
>the difference is worse:
>mean(sq) - base - realmean  # correctly returns zero
>sum(sq)/k - base - realmean # incorrectly returns -1180
>sum(sqa)/k - base - realmean # correctly returns zero
>
>Therefore (without ALTREP) sum can be inaccurate, due to floating point rounding
>errors.
>
>The good accuracy of mean() is due to a two-passes algorithm of the real_mean()
>function in src/main/summary.c.
>
>The algorithm can be summarized by the following equivalent R code:
>badmean=function(v) {sum(v)/length(v)}
>goodmean=function(v) {center = badmean(v); center + badmean(v - center)}
>goodmean(sq) - base - realmean # correctly returns zero
>
>== should mean() call ALTINTEGER_SUM ? ==
>As you noticed in the examples above, sum() is much more accurate with ALTREPs
>than with the naive algorithm, because there are no cumulative rounding errors.
>
>Moreover, if we focus on INTSXP, the maximum possible value is lower : INT_MAX =
>2^31-1
>The sum of a large sequence (e.g. 1:(2^31-1)) can still overflow the exact integer
>range (0 to 2^53) of an FP64, and the division does not always round back to the
>correct value.
>
>bad = 1:189812531L
>mean(bad) - sum(bad)/length(bad) # returns -1.5e-08 on a platform with FP80
>mean(bad) == 94906266 # correct value (the actual result is an integer)
>
>The implementation of mean() on INTSXP do not use the two-passes trick, but relies
>on a LDOUBLE (typically FP80 on the x86 platform) that is large enough to avoid
>rounding bugs even with huge integer sequences.
>
>Unfortunately the ALTINTEGER_SUM interface returns at best a FP64, and so, would
>not return the FP64 closest to the actual mean for some sequences.
>
>Adding a MEAN function to the ALTREP interface could solve the problem.
>
>== can mean performance be improved easily ? ==
>
>The mean() implementation for integers, supports ALTREPs with poor iteration
>performances, using the slow INTEGER_ELT() macro.
>Moreover, it converts each integer to LDOUBLE, which is slow.
>
>It can be improved using ITERATE_BY_REGION0 and using a 64 bits integer (if
>available) that cannot overflow on small batches (size = 512).
>
># before patching (on Ubuntu x86_64 Silvermont Celeron J1900)
>x = 1:1e8
>y = 1:1e8+0L
>system.time(mean(x)) # user 1.33 second
>system.time(mean(y)) # user 0.32 second
>
># after patching (on Ubuntu x86_64 Silvermont Celeron J1900)
># after patching (on Ubuntu x86_64 Silvermont Celeron J1900)
>x = 1:1e8
>y = 1:1e8+0L
>system.time(mean(x)) # user 0.29 second # more than x4 faster
>system.time(mean(y)) # user 0.18 second # x 1.7 faster !
>
>(patch attached to this message)
>
>The patch is not optimal. It should ideally use isum(), and risum() but these
>functions are a mess, needing refactoring. For instance, they take a 'call'
>argument and may display a warning message related to the sum() function.
>
>--
>Sincerely
>Andre GILLIBERT
>________________________________________
>De : R-devel <r-devel-bounces using r-project.org> de la part de Viechtbauer, Wolfgang
>(SP) <wolfgang.viechtbauer using maastrichtuniversity.nl>
>Envoyé : jeudi 2 septembre 2021 12:55:03
>À : r-devel using r-project.org
>Objet : [Rd] sum() and mean() for (ALTREP) integer sequences
>
>Hi all,
>
>I am trying to understand the performance of functions applied to integer
>sequences. Consider the following:
>
>### begin example ###
>
>library(lobstr)
>library(microbenchmark)
>
>x <- sample(1e6)
>obj_size(x)
># 4,000,048 B
>
>y <- 1:1e6
>obj_size(y)
># 680 B
>
># So we can see that 'y' uses ALTREP. These are, as expected, the same:
>
>sum(x)
># [1] 500000500000
>sum(y)
># [1] 500000500000
>
># For 'x', we have to go through the trouble of actually summing up 1e6 integers.
># For 'y', knowing its form, we really just need to do:
>
>1e6*(1e6+1)/2
># [1] 500000500000
>
># which should be a whole lot faster. And indeed, it is:
>
>microbenchmark(sum(x),sum(y))
>
># Unit: nanoseconds
>#    expr    min       lq      mean   median       uq    max neval cld
>#  sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519   100   b
>#  sum(y)    183    245.5    446.09    338.5    447.0   3233   100  a
>
># Now what about mean()?
>
>mean(x)
># [1] 500000.5
>mean(y)
># [1] 500000.5
>
># which is the same as
>
>(1e6+1)/2
># [1] 500000.5
>
># But this surprised me:
>
>microbenchmark(mean(x),mean(y))
>
># Unit: microseconds
>#     expr      min        lq     mean   median       uq      max neval cld
>#  mean(x)  935.389  943.4795 1021.423  954.689  985.122 2065.974   100  a
>#  mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768   100   b
>
>### end example ###
>
>So why is mean() on an ALTREP sequence slower when sum() is faster?
>
>And more generally, when using sum() on an ALTREP integer sequence, does R
>actually use something like n*(n+1)/2 (or generalized to sequences a:b --
>(a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?
>
>Best,
>Wolfgang