[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

Mon Jun 1 03:08:50 CEST 2015

This is a great example how you cannot figure it out after spending
two hours troubleshooting, but a few minutes after you post to
R-devel, it's just jumps to you (is there a word for this other than
"impatient"?);

Let me answer my own question.  The discrepancy between my sum2() code
and the internal code for base::sum() is that the latter uses LDOUBLE
= long double (on some system it's only double, cf.
https://github.com/wch/r-source/blob/trunk/src/nmath/nmath.h#L28-L33),
whereas my sum2() code uses double.  So using long double, I can
reproduce the penalty of having NA_real_ with na.rm=FALSE;

> sum3 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
#define LDOUBLE long double
 double *x_ = REAL(x);
 int narm_ = asLogical(narm);
 int n = length(x);
 LDOUBLE sum = 0.0;
 for (R_xlen_t i = 0; i < n; i++) {
   if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
 }
 return ScalarReal((double)sum);
')

> x <- rep(0, 1e8)
> stopifnot(typeof(x) == "double")
> system.time(sum3(x, narm=FALSE))
   user  system elapsed
   0.40    0.00    0.44
> y <- rep(NA_real_, 1e8)
> stopifnot(typeof(y) == "double")
> system.time(sum3(y, narm=FALSE))
   user  system elapsed
   9.80    0.00    9.84
> z <- x; z[length(z)/2] <- NA_real_
> stopifnot(typeof(z) == "double")
> system.time(sum3(z, narm=FALSE))
   user  system elapsed
   4.49    0.00    4.50

This might even be what the following comment refers to:

/* Required by C99 but might be slow */
#ifdef HAVE_LONG_DOUBLE
# define LDOUBLE long double
#else
# define LDOUBLE double
#endif

So now I should rephrase my question: Is there away to avoid this
penalty when using 'long double'?  Is this something the compiler can
be clever about, or is the only solution to not use 'long double'?

/Henrik

On Sun, May 31, 2015 at 5:02 PM, Henrik Bengtsson
<henrik.bengtsson at ucsf.edu> wrote:
> I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) == "double"
> is much more time consuming when there are missing values versus when
> there are not.  I'm observing this on both Window and Linux, but it's
> quite surprising to me.  Currently, my main suspect is settings in on
> how R was built.  The second suspect is my brain.  I hope that someone
> can clarify the below results and confirm or not whether they see the
> same.  Note, this is for "doubles", so I'm not expecting
> early-stopping as for "integers" (where testing for NA is cheap).
>
> On R 3.2.0, on Windows (using the official CRAN builds), on Linux
> (local built), and on OS X (official AT&T builds), I get:
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum(x, na.rm=FALSE))
>    user  system elapsed
>    0.19    0.01    0.20
>
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum(y, na.rm=FALSE))
>    user  system elapsed
>    9.54    0.00    9.55
>
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum(z, na.rm=FALSE))
>    user  system elapsed
>    4.49    0.00    4.51
>
> Following the source code, I'm pretty sure the code
> (https://github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128)
> performing the calculation is:
>
> static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
> {
>   LDOUBLE s = 0.0;
>   Rboolean updated = FALSE;
>   for (R_xlen_t i = 0; i < n; i++) {
>     if (!narm || !ISNAN(x[i])) {
>       if(!updated) updated = TRUE;
>         s += x[i];
>     }
>   }
>   if(s > DBL_MAX) *value = R_PosInf;
>   else if (s < -DBL_MAX) *value = R_NegInf;
>   else *value = (double) s;
>   return updated;
> }
>
> In other words, when na.rm=FALSE, that inner for loop:
>
>   for (R_xlen_t i = 0; i < n; i++) {
>     if (!narm || !ISNAN(x[i])) {
>       if(!updated) updated = TRUE;
>         s += x[i];
>     }
>   }
>
> should effectively become (because !ISNAN(x[i]) "does not make a difference"):
>
>   for (R_xlen_t i = 0; i < n; i++) {
>     if (!narm) {
>       if(!updated) updated = TRUE;
>         s += x[i];
>     }
>   }
>
> That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`.
> Now, I have always been under impression that summing with NA:s is
> *not* more expensive that summing over regular (double) values, which
> is confirmed by the below example, but the above benchmarking
> disagree.  It looks like there is a big overhead keeping track of the
> sum `s` being NA, which is supported by the fact that summing over 'z'
> is costs half of 'y'.
>
> Now, I *cannot* reproduce the above using the following 'inline' example:
>
>> sum2 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
>  double *x_ = REAL(x);
>  int narm_ = asLogical(narm);
>  int n = length(x);
>  double sum = 0;
>  for (R_xlen_t i = 0; i < n; i++) {
>    if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
>  }
>  return ScalarReal(sum);
> ')
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum2(x, narm=FALSE))
>    user  system elapsed
>    0.16    0.00    0.16
>
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum2(y, narm=FALSE))
>    user  system elapsed
>    0.16    0.00    0.15
>
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum2(z, narm=FALSE))
>    user  system elapsed
>    0.16    0.00    0.15
>
> This is why I suspect it's related to how R was configured when it was
> built. What's going on? Can someone please bring some light on this?
>
> Thanks
>
> Henrik