[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...
Henrik Bengtsson
henrik.bengtsson at ucsf.edu
Mon Jun 1 03:08:50 CEST 2015
This is a great example how you cannot figure it out after spending
two hours troubleshooting, but a few minutes after you post to
R-devel, it's just jumps to you (is there a word for this other than
"impatient"?);
Let me answer my own question. The discrepancy between my sum2() code
and the internal code for base::sum() is that the latter uses LDOUBLE
= long double (on some system it's only double, cf.
https://github.com/wch/r-source/blob/trunk/src/nmath/nmath.h#L28-L33),
whereas my sum2() code uses double. So using long double, I can
reproduce the penalty of having NA_real_ with na.rm=FALSE;
> sum3 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
#define LDOUBLE long double
double *x_ = REAL(x);
int narm_ = asLogical(narm);
int n = length(x);
LDOUBLE sum = 0.0;
for (R_xlen_t i = 0; i < n; i++) {
if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
}
return ScalarReal((double)sum);
')
> x <- rep(0, 1e8)
> stopifnot(typeof(x) == "double")
> system.time(sum3(x, narm=FALSE))
user system elapsed
0.40 0.00 0.44
> y <- rep(NA_real_, 1e8)
> stopifnot(typeof(y) == "double")
> system.time(sum3(y, narm=FALSE))
user system elapsed
9.80 0.00 9.84
> z <- x; z[length(z)/2] <- NA_real_
> stopifnot(typeof(z) == "double")
> system.time(sum3(z, narm=FALSE))
user system elapsed
4.49 0.00 4.50
This might even be what the following comment refers to:
/* Required by C99 but might be slow */
#ifdef HAVE_LONG_DOUBLE
# define LDOUBLE long double
#else
# define LDOUBLE double
#endif
So now I should rephrase my question: Is there away to avoid this
penalty when using 'long double'? Is this something the compiler can
be clever about, or is the only solution to not use 'long double'?
/Henrik
On Sun, May 31, 2015 at 5:02 PM, Henrik Bengtsson
<henrik.bengtsson at ucsf.edu> wrote:
> I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) == "double"
> is much more time consuming when there are missing values versus when
> there are not. I'm observing this on both Window and Linux, but it's
> quite surprising to me. Currently, my main suspect is settings in on
> how R was built. The second suspect is my brain. I hope that someone
> can clarify the below results and confirm or not whether they see the
> same. Note, this is for "doubles", so I'm not expecting
> early-stopping as for "integers" (where testing for NA is cheap).
>
> On R 3.2.0, on Windows (using the official CRAN builds), on Linux
> (local built), and on OS X (official AT&T builds), I get:
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum(x, na.rm=FALSE))
> user system elapsed
> 0.19 0.01 0.20
>
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum(y, na.rm=FALSE))
> user system elapsed
> 9.54 0.00 9.55
>
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum(z, na.rm=FALSE))
> user system elapsed
> 4.49 0.00 4.51
>
> Following the source code, I'm pretty sure the code
> (https://github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128)
> performing the calculation is:
>
> static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
> {
> LDOUBLE s = 0.0;
> Rboolean updated = FALSE;
> for (R_xlen_t i = 0; i < n; i++) {
> if (!narm || !ISNAN(x[i])) {
> if(!updated) updated = TRUE;
> s += x[i];
> }
> }
> if(s > DBL_MAX) *value = R_PosInf;
> else if (s < -DBL_MAX) *value = R_NegInf;
> else *value = (double) s;
> return updated;
> }
>
> In other words, when na.rm=FALSE, that inner for loop:
>
> for (R_xlen_t i = 0; i < n; i++) {
> if (!narm || !ISNAN(x[i])) {
> if(!updated) updated = TRUE;
> s += x[i];
> }
> }
>
> should effectively become (because !ISNAN(x[i]) "does not make a difference"):
>
> for (R_xlen_t i = 0; i < n; i++) {
> if (!narm) {
> if(!updated) updated = TRUE;
> s += x[i];
> }
> }
>
> That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`.
> Now, I have always been under impression that summing with NA:s is
> *not* more expensive that summing over regular (double) values, which
> is confirmed by the below example, but the above benchmarking
> disagree. It looks like there is a big overhead keeping track of the
> sum `s` being NA, which is supported by the fact that summing over 'z'
> is costs half of 'y'.
>
> Now, I *cannot* reproduce the above using the following 'inline' example:
>
>> sum2 <- inline::cfunction(sig=c(x="double", narm="logical"), body='
> double *x_ = REAL(x);
> int narm_ = asLogical(narm);
> int n = length(x);
> double sum = 0;
> for (R_xlen_t i = 0; i < n; i++) {
> if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
> }
> return ScalarReal(sum);
> ')
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum2(x, narm=FALSE))
> user system elapsed
> 0.16 0.00 0.16
>
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum2(y, narm=FALSE))
> user system elapsed
> 0.16 0.00 0.15
>
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum2(z, narm=FALSE))
> user system elapsed
> 0.16 0.00 0.15
>
> This is why I suspect it's related to how R was configured when it was
> built. What's going on? Can someone please bring some light on this?
>
> Thanks
>
> Henrik
More information about the R-devel
mailing list