[Rd] Question re: NA, NaNs in R

Kevin Ushey kevinushey at gmail.com
Mon Feb 10 19:43:50 CET 2014


Also, similarly, to clarify, should there be _one_ unique bit pattern
for R's NA_REAL, or two? Because I see (for a function hex that
produces the hex representation of a number):

> hex(NA_real_)
[1] "7FF00000000007A2"
> hex(NA_real_+1)
[1] "7FF80000000007A2"
> hex(NaN)
[1] "7FF8000000000000"

This is with 64-bit R (on OS X Mavericks, R-devel r64910), as well. I
also noticed in a conversation of Arun (co-author of data.table) that:

On 32-bit R-2.15.3:

NA: 7ff80000000007a2
NaN: 7ff8000000000000

On 64-bit version of R-2.15.3
NA: 7ff00000000007a2
NaN: 7ff8000000000000

Notice that the initial bit pattern is 7ff0, rather than 7ff8, for
64-bit R. Is this intentional?

Thanks,
Kevin

(function follows:)

// assume size of double, unsigned long long is the same
SEXP hex(SEXP x) {

  // double is 8 bytes, each byte can be represented by 2 hex chars,
  // so need a str with 16+1 slots
  int n = sizeof(unsigned long long) * 2 + 1;

  unsigned long long *xx = (unsigned long long*) REAL(x);
  char buf[n];
  snprintf(buf, n, "%016llX", *xx);
  SEXP output = PROTECT(allocVector(STRSXP, 1));
  SET_STRING_ELT(output, 0, mkChar(buf));
  UNPROTECT(1);
  return output;
}

On Mon, Feb 10, 2014 at 10:07 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
> On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
>>
>> This isn't quite what you were asking, but might inform your choice.
>>
>> R doesn't try to maintain the distinction between NA and NaN when
>> doing calculations, e.g.:
>> > NA + NaN
>> [1] NA
>> > NaN + NA
>> [1] NaN
>> So for the aggregate package, I didn't attempt to treat them differently.
>
>
> This looks like a bug to me.  In 32 bit 3.0.2 and R-patched I see
>
>
>> NA + NaN
> [1] NA
>> NaN + NA
> [1] NA
>
> This seems more reasonable to me.  NA should propagate.  (I can see an
> argument for NaN for the answer here, as I can't think of any possible
> non-missing value that would give anything else when added to NaN, but the
> answer should not depend on the order of operands.)
>
> However, I get the same as you in 64 bit 3.0.2.  All calculations I've shown
> are on 64 bit Windows 7.
>
> Duncan Murdoch
>
>
>
>>
>> The aggregate package is available at
>> http://www.timhesterberg.net/r-packages
>>
>> Here is the inst/doc/missingValues.txt file from that package:
>>
>> --------------------------------------------------
>> Copyright 2012 Google Inc. All Rights Reserved.
>> Author: Tim Hesterberg <rocket at google.com>
>> Distributed under GPL 2 or later.
>>
>>
>>         Handling of missing values and not-a-numbers.
>>
>>
>> Here I'll note how this package handles missing values.
>> I do it the way R handles them, rather than the more strict way that S+
>> does.
>>
>> First, for terminology,
>>    NaN = "not-a-number", e.g. the result of 0/0
>>    NA  = "missing value" or "true missing value", e.g. survey non-response
>>    xx  = I'll uses this for the union of those, or "missing value of any
>> kind".
>>
>> For background, at the hardware level there is an IEEE standard that
>> specifies that certain bit patterns are NaN, and specifies that
>> operations involving an NaN result in another NaN.
>>
>> That standard doesn't say anything about missing values, which are
>> important in statistics.
>>
>> So what R and S+ do is to pick one of the bit patterns and declare
>> that to be a NA.  In other words, the NA bit pattern is a subset of
>> the NaN bit patterns.
>>
>> At the user level, the reverse seems to hold.
>> You can assign either NA or NaN to an object.
>> But:
>>         is.na(x) returns TRUE for both
>>         is.nan(x) returns TRUE for NaN and FALSE for NA
>> Based on that, you'd think that NaN is a subset of NA.
>> To tell whether something is a true missing value do:
>>         (is.na(x) & !is.nan(x))
>>
>> The S+ convention is that any operation involving NA results in an NA;
>> otherwise any operation involving NaN results in NaN.
>>
>> The R convention is that any operation involving xx results in an xx;
>> a missing value of any kind results in another missing value of any
>> kind.  R considers NA and NaN equivalent for testing purposes:
>>         all.equal(NA_real_, NaN)
>> gives TRUE.
>>
>> Some R functions follow the S+ convention, e.g. the Math2 functions
>> in src/main/arithmetic.c use this macro:
>> #define if_NA_Math2_set(y,a,b)                          \
>>         if      (ISNA (a) || ISNA (b)) y = NA_REAL;     \
>>         else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
>>
>> Other R functions, like the basic arithmetic operations +-/*^,
>> do not (search for PLUSOP in src/main/arithmetic.c).
>> They just let the hardware do the calculations.
>> As a result, you can get odd results like
>> > is.nan(NA_real_ + NaN)
>> [1] FALSE
>> > is.nan(NaN + NA_real_)
>> [1] TRUE
>>
>> The R help files help(is.na) and help(is.nan) suggest that
>> computations involving NA and NaN are indeterminate.
>>
>> It is faster to use the R convention; most operations are just
>> handled by the hardware, without extra work.
>>
>> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
>> and NaN are removed.
>>
>>
>>
>>
>> >There is one NA but mulitple NaNs.
>> >
>> >And please re-read 'man memcmp': your cast is wrong.
>> >
>> >On 10/02/2014 06:52, Kevin Ushey wrote:
>> >> Hi R-devel,
>> >>
>> >> I have a question about the differentiation between NA and NaN values
>> >> as implemented in R. In arithmetic.c, we have
>> >>
>> >> int R_IsNA(double x)
>> >> {
>> >>      if (isnan(x)) {
>> >> ieee_double y;
>> >> y.value = x;
>> >> return (y.word[lw] == 1954);
>> >>      }
>> >>      return 0;
>> >> }
>> >>
>> >> ieee_double is just used for type punning so we can check the final
>> >> bits and see if they're equal to 1954; if they are, x is NA, if
>> >> they're not, x is NaN (as defined for R_IsNaN).
>> >>
>> >> My question is -- I can see a substantial increase in speed (on my
>> >> computer, in certain cases) if I replace this check with
>> >>
>> >> int R_IsNA(double x)
>> >> {
>> >>      return memcmp(
>> >>          (char*)(&x),
>> >>          (char*)(&NA_REAL),
>> >>          sizeof(double)
>> >>      ) == 0;
>> >> }
>> >>
>> >> IIUC, there is only one bit pattern used to encode R NA values, so
>> >> this should be safe. But I would like to be sure:
>> >>
>> >> Is there any guarantee that the different functions in R would return
>> >> NA as identical to the bit pattern defined for NA_REAL, for a given
>> >> architecture? Similarly for NaN value(s) and R_NaN?
>> >>
>> >> My guess is that it is possible some functions used internally by R
>> >> might encode NaN values differently; ie, setting the lower word to a
>> >> value different than 1954 (hence being NaN, but potentially not
>> >> identical to R_NaN), or perhaps this is architecture-dependent.
>> >> However, NA should be one specific bit pattern (?). And, I wonder if
>> >> there is any guarantee that the different functions used in R would
>> >> return an NaN value as identical to R_NaN (which appears to be the
>> >> 'IEEE NaN')?
>> >>
>> >> (interested parties can see + run a simple benchmark from the gist at
>> >> https://gist.github.com/kevinushey/8911432)
>> >>
>> >> Thanks,
>> >> Kevin
>> >>
>> >> ______________________________________________
>> >> R-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >
>> >
>> >--
>> >Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> >Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> >University of Oxford,             Tel:  +44 1865 272861 (self)
>> >1 South Parks Road,                     +44 1865 272866 (PA)
>> >Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



More information about the R-devel mailing list