[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...
Henrik Bengtsson
henrik.bengtsson at ucsf.edu
Sat Jun 6 06:28:48 CEST 2015
I should summarize what I've learned after my more recent post on this:
With 'long double', it is extremely expensive to keep "track of"
non-finites (NA, NaN and Inf) when doing basic arithmetics. The most
extreme cost is seen when the first value is non-finite, because there
is penalty added in all of the following iterations.
EXAMPLE:
## All finite values
> x <- rep(0.1, 1e8)
> system.time(sum(x, narm=FALSE))
user system elapsed
0.28 0.00 0.28
## First value is NA (maximum penalty)
> y <- x; x[1] <- NA
> system.time(sum(y, narm=FALSE))
user system elapsed
23.47 0.00 23.93 (sic!)
## Last value is NA (minimum penalty)
> z <- x; x[length(z)] <- NA
> system.time(sum(z, narm=FALSE))
user system elapsed
0.35 0.00 0.34
## Silly, but proof of concept by summing rev(y)
## such that NA ends up last
> system.time(sum(rev(y), narm=FALSE))
user system elapsed
2.40 0.85 3.31
This penalty of having non-finite values is only observed with 'long
double', but not with 'double' (see previous message).
WORKAROUND:
To workaround this penalty, one can test the running result for being
non-finite and return early, e.g. if na.rm=FALSE then as soon as the
running sum is NA_REAL, return NA_REAL. Unfortunately, the cost for
testing for NA_REAL is expensive, so the penality of doing every
iteration would be too expensive. A better approach is to test for
early stopping at some interval. For example:
> sum4 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
#define LDOUBLE long double
double *x_ = REAL(x);
int narm_ = asLogical(narm);
int n = length(x);
LDOUBLE sum = 0.0;
for (R_xlen_t i = 0; i < n; i++) {
/* If narm=FALSE, always sum */
if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
/* Early stopping check when sum non-finite */
if (!R_FINITE(sum)) break;
}
return ScalarReal((double)sum);
')
## See previous email for non-early stopping version
> system.time(sum3(x, narm=FALSE))
user system elapsed
1.05 0.00 1.05
## Early stopping every iteration adds quite a bit
## of overhead if never used
> system.time(sum4(x, narm=FALSE))
user system elapsed
1.79 0.00 1.86
## Early stopping once in a while
> sum5 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
#define LDOUBLE long double
double *x_ = REAL(x);
int narm_ = asLogical(narm);
int n = length(x);
LDOUBLE sum = 0.0;
for (R_xlen_t i = 0; i < n; i++) {
/* If narm=FALSE, always sum */
if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
/* Early stopping check when sum non-finite */
if (i % 1048576 == 0 && !R_FINITE(sum)) break;
}
return ScalarReal((double)sum);
')
## Low extra cost even without non-finites
> system.time(sum5(x, narm=FALSE))
user system elapsed
1.05 0.00 1.06
## ...and can still do early stopping
> system.time(sum5(y, narm=FALSE))
user system elapsed
0 0 0
/Henrik
On Sun, May 31, 2015 at 6:08 PM, Henrik Bengtsson
<henrik.bengtsson at ucsf.edu> wrote:
> This is a great example how you cannot figure it out after spending
> two hours troubleshooting, but a few minutes after you post to
> R-devel, it's just jumps to you (is there a word for this other than
> "impatient"?);
>
> Let me answer my own question. The discrepancy between my sum2() code
> and the internal code for base::sum() is that the latter uses LDOUBLE
> = long double (on some system it's only double, cf.
> https://github.com/wch/r-source/blob/trunk/src/nmath/nmath.h#L28-L33),
> whereas my sum2() code uses double. So using long double, I can
> reproduce the penalty of having NA_real_ with na.rm=FALSE;
>
>> sum3 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
> #define LDOUBLE long double
> double *x_ = REAL(x);
> int narm_ = asLogical(narm);
> int n = length(x);
> LDOUBLE sum = 0.0;
> for (R_xlen_t i = 0; i < n; i++) {
> if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
> }
> return ScalarReal((double)sum);
> ')
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum3(x, narm=FALSE))
> user system elapsed
> 0.40 0.00 0.44
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum3(y, narm=FALSE))
> user system elapsed
> 9.80 0.00 9.84
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum3(z, narm=FALSE))
> user system elapsed
> 4.49 0.00 4.50
>
> This might even be what the following comment refers to:
>
> /* Required by C99 but might be slow */
> #ifdef HAVE_LONG_DOUBLE
> # define LDOUBLE long double
> #else
> # define LDOUBLE double
> #endif
>
> So now I should rephrase my question: Is there away to avoid this
> penalty when using 'long double'? Is this something the compiler can
> be clever about, or is the only solution to not use 'long double'?
>
> /Henrik
>
> On Sun, May 31, 2015 at 5:02 PM, Henrik Bengtsson
> <henrik.bengtsson at ucsf.edu> wrote:
>> I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) == "double"
>> is much more time consuming when there are missing values versus when
>> there are not. I'm observing this on both Window and Linux, but it's
>> quite surprising to me. Currently, my main suspect is settings in on
>> how R was built. The second suspect is my brain. I hope that someone
>> can clarify the below results and confirm or not whether they see the
>> same. Note, this is for "doubles", so I'm not expecting
>> early-stopping as for "integers" (where testing for NA is cheap).
>>
>> On R 3.2.0, on Windows (using the official CRAN builds), on Linux
>> (local built), and on OS X (official AT&T builds), I get:
>>
>>> x <- rep(0, 1e8)
>>> stopifnot(typeof(x) == "double")
>>> system.time(sum(x, na.rm=FALSE))
>> user system elapsed
>> 0.19 0.01 0.20
>>
>>> y <- rep(NA_real_, 1e8)
>>> stopifnot(typeof(y) == "double")
>>> system.time(sum(y, na.rm=FALSE))
>> user system elapsed
>> 9.54 0.00 9.55
>>
>>> z <- x; z[length(z)/2] <- NA_real_
>>> stopifnot(typeof(z) == "double")
>>> system.time(sum(z, na.rm=FALSE))
>> user system elapsed
>> 4.49 0.00 4.51
>>
>> Following the source code, I'm pretty sure the code
>> (https://github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128)
>> performing the calculation is:
>>
>> static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
>> {
>> LDOUBLE s = 0.0;
>> Rboolean updated = FALSE;
>> for (R_xlen_t i = 0; i < n; i++) {
>> if (!narm || !ISNAN(x[i])) {
>> if(!updated) updated = TRUE;
>> s += x[i];
>> }
>> }
>> if(s > DBL_MAX) *value = R_PosInf;
>> else if (s < -DBL_MAX) *value = R_NegInf;
>> else *value = (double) s;
>> return updated;
>> }
>>
>> In other words, when na.rm=FALSE, that inner for loop:
>>
>> for (R_xlen_t i = 0; i < n; i++) {
>> if (!narm || !ISNAN(x[i])) {
>> if(!updated) updated = TRUE;
>> s += x[i];
>> }
>> }
>>
>> should effectively become (because !ISNAN(x[i]) "does not make a difference"):
>>
>> for (R_xlen_t i = 0; i < n; i++) {
>> if (!narm) {
>> if(!updated) updated = TRUE;
>> s += x[i];
>> }
>> }
>>
>> That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`.
>> Now, I have always been under impression that summing with NA:s is
>> *not* more expensive that summing over regular (double) values, which
>> is confirmed by the below example, but the above benchmarking
>> disagree. It looks like there is a big overhead keeping track of the
>> sum `s` being NA, which is supported by the fact that summing over 'z'
>> is costs half of 'y'.
>>
>> Now, I *cannot* reproduce the above using the following 'inline' example:
>>
>>> sum2 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
>> double *x_ = REAL(x);
>> int narm_ = asLogical(narm);
>> int n = length(x);
>> double sum = 0;
>> for (R_xlen_t i = 0; i < n; i++) {
>> if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
>> }
>> return ScalarReal(sum);
>> ')
>>
>>> x <- rep(0, 1e8)
>>> stopifnot(typeof(x) == "double")
>>> system.time(sum2(x, narm=FALSE))
>> user system elapsed
>> 0.16 0.00 0.16
>>
>>> y <- rep(NA_real_, 1e8)
>>> stopifnot(typeof(y) == "double")
>>> system.time(sum2(y, narm=FALSE))
>> user system elapsed
>> 0.16 0.00 0.15
>>
>>> z <- x; z[length(z)/2] <- NA_real_
>>> stopifnot(typeof(z) == "double")
>>> system.time(sum2(z, narm=FALSE))
>> user system elapsed
>> 0.16 0.00 0.15
>>
>> This is why I suspect it's related to how R was configured when it was
>> built. What's going on? Can someone please bring some light on this?
>>
>> Thanks
>>
>> Henrik
More information about the R-devel
mailing list