[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

Sat Jun 6 06:28:48 CEST 2015

I should summarize what I've learned after my more recent post on this:

With 'long double', it is extremely expensive to keep "track of"
non-finites (NA, NaN and Inf) when doing basic arithmetics.  The most
extreme cost is seen when the first value is non-finite, because there
is penalty added in all of the following iterations.

EXAMPLE:

## All finite values
> x <- rep(0.1, 1e8)
> system.time(sum(x, narm=FALSE))
   user  system elapsed
   0.28    0.00    0.28

## First value is NA (maximum penalty)
> y <- x; x[1] <- NA
> system.time(sum(y, narm=FALSE))
   user  system elapsed
  23.47    0.00   23.93   (sic!)

## Last value is NA (minimum penalty)
> z <- x; x[length(z)] <- NA
> system.time(sum(z, narm=FALSE))
   user  system elapsed
   0.35    0.00    0.34

## Silly, but proof of concept by summing rev(y)
## such that NA ends up last
> system.time(sum(rev(y), narm=FALSE))
   user  system elapsed
   2.40    0.85    3.31

This penalty of having non-finite values is only observed with 'long
double', but not with 'double' (see previous message).

WORKAROUND:
To workaround this penalty, one can test the running result for being
non-finite and return early, e.g. if na.rm=FALSE then as soon as the
running sum is NA_REAL, return NA_REAL.  Unfortunately, the cost for
testing for NA_REAL is expensive, so the penality of doing every
iteration would be too expensive.  A better approach is to test for
early stopping at some interval.  For example:

> sum4 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
#define LDOUBLE long double
 double *x_ = REAL(x);
 int narm_ = asLogical(narm);
 int n = length(x);
 LDOUBLE sum = 0.0;
 for (R_xlen_t i = 0; i < n; i++) {
   /* If narm=FALSE, always sum */
   if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
   /* Early stopping check when sum non-finite */
   if (!R_FINITE(sum)) break;
 }
 return ScalarReal((double)sum);
')

## See previous email for non-early stopping version
> system.time(sum3(x, narm=FALSE))
   user  system elapsed
   1.05    0.00    1.05

## Early stopping every iteration adds quite a bit
## of overhead if never used
> system.time(sum4(x, narm=FALSE))
   user  system elapsed
   1.79    0.00    1.86

## Early stopping once in a while
> sum5 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
#define LDOUBLE long double
 double *x_ = REAL(x);
 int narm_ = asLogical(narm);
 int n = length(x);
 LDOUBLE sum = 0.0;
 for (R_xlen_t i = 0; i < n; i++) {
   /* If narm=FALSE, always sum */
   if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
   /* Early stopping check when sum non-finite */
   if (i % 1048576 == 0 && !R_FINITE(sum)) break;
 }
 return ScalarReal((double)sum);
')

## Low extra cost even without non-finites
> system.time(sum5(x, narm=FALSE))
   user  system elapsed
   1.05    0.00    1.06

## ...and can still do early stopping
> system.time(sum5(y, narm=FALSE))
   user  system elapsed
      0       0       0

/Henrik

On Sun, May 31, 2015 at 6:08 PM, Henrik Bengtsson
<henrik.bengtsson at ucsf.edu> wrote:
> This is a great example how you cannot figure it out after spending
> two hours troubleshooting, but a few minutes after you post to
> R-devel, it's just jumps to you (is there a word for this other than
> "impatient"?);
>
> Let me answer my own question.  The discrepancy between my sum2() code
> and the internal code for base::sum() is that the latter uses LDOUBLE
> = long double (on some system it's only double, cf.
> https://github.com/wch/r-source/blob/trunk/src/nmath/nmath.h#L28-L33),
> whereas my sum2() code uses double.  So using long double, I can
> reproduce the penalty of having NA_real_ with na.rm=FALSE;
>
>> sum3 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
> #define LDOUBLE long double
>  double *x_ = REAL(x);
>  int narm_ = asLogical(narm);
>  int n = length(x);
>  LDOUBLE sum = 0.0;
>  for (R_xlen_t i = 0; i < n; i++) {
>    if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
>  }
>  return ScalarReal((double)sum);
> ')
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum3(x, narm=FALSE))
>    user  system elapsed
>    0.40    0.00    0.44
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum3(y, narm=FALSE))
>    user  system elapsed
>    9.80    0.00    9.84
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum3(z, narm=FALSE))
>    user  system elapsed
>    4.49    0.00    4.50
>
> This might even be what the following comment refers to:
>
> /* Required by C99 but might be slow */
> #ifdef HAVE_LONG_DOUBLE
> # define LDOUBLE long double
> #else
> # define LDOUBLE double
> #endif
>
> So now I should rephrase my question: Is there away to avoid this
> penalty when using 'long double'?  Is this something the compiler can
> be clever about, or is the only solution to not use 'long double'?
>
> /Henrik
>
> On Sun, May 31, 2015 at 5:02 PM, Henrik Bengtsson
> <henrik.bengtsson at ucsf.edu> wrote:
>> I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) == "double"
>> is much more time consuming when there are missing values versus when
>> there are not.  I'm observing this on both Window and Linux, but it's
>> quite surprising to me.  Currently, my main suspect is settings in on
>> how R was built.  The second suspect is my brain.  I hope that someone
>> can clarify the below results and confirm or not whether they see the
>> same.  Note, this is for "doubles", so I'm not expecting
>> early-stopping as for "integers" (where testing for NA is cheap).
>>
>> On R 3.2.0, on Windows (using the official CRAN builds), on Linux
>> (local built), and on OS X (official AT&T builds), I get:
>>
>>> x <- rep(0, 1e8)
>>> stopifnot(typeof(x) == "double")
>>> system.time(sum(x, na.rm=FALSE))
>>    user  system elapsed
>>    0.19    0.01    0.20
>>
>>> y <- rep(NA_real_, 1e8)
>>> stopifnot(typeof(y) == "double")
>>> system.time(sum(y, na.rm=FALSE))
>>    user  system elapsed
>>    9.54    0.00    9.55
>>
>>> z <- x; z[length(z)/2] <- NA_real_
>>> stopifnot(typeof(z) == "double")
>>> system.time(sum(z, na.rm=FALSE))
>>    user  system elapsed
>>    4.49    0.00    4.51
>>
>> Following the source code, I'm pretty sure the code
>> (https://github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128)
>> performing the calculation is:
>>
>> static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
>> {
>>   LDOUBLE s = 0.0;
>>   Rboolean updated = FALSE;
>>   for (R_xlen_t i = 0; i < n; i++) {
>>     if (!narm || !ISNAN(x[i])) {
>>       if(!updated) updated = TRUE;
>>         s += x[i];
>>     }
>>   }
>>   if(s > DBL_MAX) *value = R_PosInf;
>>   else if (s < -DBL_MAX) *value = R_NegInf;
>>   else *value = (double) s;
>>   return updated;
>> }
>>
>> In other words, when na.rm=FALSE, that inner for loop:
>>
>>   for (R_xlen_t i = 0; i < n; i++) {
>>     if (!narm || !ISNAN(x[i])) {
>>       if(!updated) updated = TRUE;
>>         s += x[i];
>>     }
>>   }
>>
>> should effectively become (because !ISNAN(x[i]) "does not make a difference"):
>>
>>   for (R_xlen_t i = 0; i < n; i++) {
>>     if (!narm) {
>>       if(!updated) updated = TRUE;
>>         s += x[i];
>>     }
>>   }
>>
>> That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`.
>> Now, I have always been under impression that summing with NA:s is
>> *not* more expensive that summing over regular (double) values, which
>> is confirmed by the below example, but the above benchmarking
>> disagree.  It looks like there is a big overhead keeping track of the
>> sum `s` being NA, which is supported by the fact that summing over 'z'
>> is costs half of 'y'.
>>
>> Now, I *cannot* reproduce the above using the following 'inline' example:
>>
>>> sum2 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
>>  double *x_ = REAL(x);
>>  int narm_ = asLogical(narm);
>>  int n = length(x);
>>  double sum = 0;
>>  for (R_xlen_t i = 0; i < n; i++) {
>>    if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
>>  }
>>  return ScalarReal(sum);
>> ')
>>
>>> x <- rep(0, 1e8)
>>> stopifnot(typeof(x) == "double")
>>> system.time(sum2(x, narm=FALSE))
>>    user  system elapsed
>>    0.16    0.00    0.16
>>
>>> y <- rep(NA_real_, 1e8)
>>> stopifnot(typeof(y) == "double")
>>> system.time(sum2(y, narm=FALSE))
>>    user  system elapsed
>>    0.16    0.00    0.15
>>
>>> z <- x; z[length(z)/2] <- NA_real_
>>> stopifnot(typeof(z) == "double")
>>> system.time(sum2(z, narm=FALSE))
>>    user  system elapsed
>>    0.16    0.00    0.15
>>
>> This is why I suspect it's related to how R was configured when it was
>> built. What's going on? Can someone please bring some light on this?
>>
>> Thanks
>>
>> Henrik