[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

Mon Jun 1 02:02:22 CEST 2015

I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) == "double"
is much more time consuming when there are missing values versus when
there are not.  I'm observing this on both Window and Linux, but it's
quite surprising to me.  Currently, my main suspect is settings in on
how R was built.  The second suspect is my brain.  I hope that someone
can clarify the below results and confirm or not whether they see the
same.  Note, this is for "doubles", so I'm not expecting
early-stopping as for "integers" (where testing for NA is cheap).

On R 3.2.0, on Windows (using the official CRAN builds), on Linux
(local built), and on OS X (official AT&T builds), I get:

> x <- rep(0, 1e8)
> stopifnot(typeof(x) == "double")
> system.time(sum(x, na.rm=FALSE))
   user  system elapsed
   0.19    0.01    0.20

> y <- rep(NA_real_, 1e8)
> stopifnot(typeof(y) == "double")
> system.time(sum(y, na.rm=FALSE))
   user  system elapsed
   9.54    0.00    9.55

> z <- x; z[length(z)/2] <- NA_real_
> stopifnot(typeof(z) == "double")
> system.time(sum(z, na.rm=FALSE))
   user  system elapsed
   4.49    0.00    4.51

Following the source code, I'm pretty sure the code
(https://github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128)
performing the calculation is:

static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
{
  LDOUBLE s = 0.0;
  Rboolean updated = FALSE;
  for (R_xlen_t i = 0; i < n; i++) {
    if (!narm || !ISNAN(x[i])) {
      if(!updated) updated = TRUE;
        s += x[i];
    }
  }
  if(s > DBL_MAX) *value = R_PosInf;
  else if (s < -DBL_MAX) *value = R_NegInf;
  else *value = (double) s;
  return updated;
}

In other words, when na.rm=FALSE, that inner for loop:

  for (R_xlen_t i = 0; i < n; i++) {
    if (!narm || !ISNAN(x[i])) {
      if(!updated) updated = TRUE;
        s += x[i];
    }
  }

should effectively become (because !ISNAN(x[i]) "does not make a difference"):

  for (R_xlen_t i = 0; i < n; i++) {
    if (!narm) {
      if(!updated) updated = TRUE;
        s += x[i];
    }
  }

That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`.
Now, I have always been under impression that summing with NA:s is
*not* more expensive that summing over regular (double) values, which
is confirmed by the below example, but the above benchmarking
disagree.  It looks like there is a big overhead keeping track of the
sum `s` being NA, which is supported by the fact that summing over 'z'
is costs half of 'y'.

Now, I *cannot* reproduce the above using the following 'inline' example:

> sum2 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
 double *x_ = REAL(x);
 int narm_ = asLogical(narm);
 int n = length(x);
 double sum = 0;
 for (R_xlen_t i = 0; i < n; i++) {
   if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
 }
 return ScalarReal(sum);
')

> x <- rep(0, 1e8)
> stopifnot(typeof(x) == "double")
> system.time(sum2(x, narm=FALSE))
   user  system elapsed
   0.16    0.00    0.16

> y <- rep(NA_real_, 1e8)
> stopifnot(typeof(y) == "double")
> system.time(sum2(y, narm=FALSE))
   user  system elapsed
   0.16    0.00    0.15

> z <- x; z[length(z)/2] <- NA_real_
> stopifnot(typeof(z) == "double")
> system.time(sum2(z, narm=FALSE))
   user  system elapsed
   0.16    0.00    0.15

This is why I suspect it's related to how R was configured when it was
built. What's going on? Can someone please bring some light on this?

Thanks

Henrik