[Rd] Question re: NA, NaNs in R
Duncan Murdoch
murdoch.duncan at gmail.com
Mon Feb 10 19:07:51 CET 2014
On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
> This isn't quite what you were asking, but might inform your choice.
>
> R doesn't try to maintain the distinction between NA and NaN when
> doing calculations, e.g.:
> > NA + NaN
> [1] NA
> > NaN + NA
> [1] NaN
> So for the aggregate package, I didn't attempt to treat them differently.
This looks like a bug to me. In 32 bit 3.0.2 and R-patched I see
> NA + NaN
[1] NA
> NaN + NA
[1] NA
This seems more reasonable to me. NA should propagate. (I can see an
argument for NaN for the answer here, as I can't think of any possible
non-missing value that would give anything else when added to NaN, but
the answer should not depend on the order of operands.)
However, I get the same as you in 64 bit 3.0.2. All calculations I've
shown are on 64 bit Windows 7.
Duncan Murdoch
>
> The aggregate package is available at
> http://www.timhesterberg.net/r-packages
>
> Here is the inst/doc/missingValues.txt file from that package:
>
> --------------------------------------------------
> Copyright 2012 Google Inc. All Rights Reserved.
> Author: Tim Hesterberg <rocket at google.com>
> Distributed under GPL 2 or later.
>
>
> Handling of missing values and not-a-numbers.
>
>
> Here I'll note how this package handles missing values.
> I do it the way R handles them, rather than the more strict way that S+ does.
>
> First, for terminology,
> NaN = "not-a-number", e.g. the result of 0/0
> NA = "missing value" or "true missing value", e.g. survey non-response
> xx = I'll uses this for the union of those, or "missing value of any kind".
>
> For background, at the hardware level there is an IEEE standard that
> specifies that certain bit patterns are NaN, and specifies that
> operations involving an NaN result in another NaN.
>
> That standard doesn't say anything about missing values, which are
> important in statistics.
>
> So what R and S+ do is to pick one of the bit patterns and declare
> that to be a NA. In other words, the NA bit pattern is a subset of
> the NaN bit patterns.
>
> At the user level, the reverse seems to hold.
> You can assign either NA or NaN to an object.
> But:
> is.na(x) returns TRUE for both
> is.nan(x) returns TRUE for NaN and FALSE for NA
> Based on that, you'd think that NaN is a subset of NA.
> To tell whether something is a true missing value do:
> (is.na(x) & !is.nan(x))
>
> The S+ convention is that any operation involving NA results in an NA;
> otherwise any operation involving NaN results in NaN.
>
> The R convention is that any operation involving xx results in an xx;
> a missing value of any kind results in another missing value of any
> kind. R considers NA and NaN equivalent for testing purposes:
> all.equal(NA_real_, NaN)
> gives TRUE.
>
> Some R functions follow the S+ convention, e.g. the Math2 functions
> in src/main/arithmetic.c use this macro:
> #define if_NA_Math2_set(y,a,b) \
> if (ISNA (a) || ISNA (b)) y = NA_REAL; \
> else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
>
> Other R functions, like the basic arithmetic operations +-/*^,
> do not (search for PLUSOP in src/main/arithmetic.c).
> They just let the hardware do the calculations.
> As a result, you can get odd results like
> > is.nan(NA_real_ + NaN)
> [1] FALSE
> > is.nan(NaN + NA_real_)
> [1] TRUE
>
> The R help files help(is.na) and help(is.nan) suggest that
> computations involving NA and NaN are indeterminate.
>
> It is faster to use the R convention; most operations are just
> handled by the hardware, without extra work.
>
> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
> and NaN are removed.
>
>
>
>
> >There is one NA but mulitple NaNs.
> >
> >And please re-read 'man memcmp': your cast is wrong.
> >
> >On 10/02/2014 06:52, Kevin Ushey wrote:
> >> Hi R-devel,
> >>
> >> I have a question about the differentiation between NA and NaN values
> >> as implemented in R. In arithmetic.c, we have
> >>
> >> int R_IsNA(double x)
> >> {
> >> if (isnan(x)) {
> >> ieee_double y;
> >> y.value = x;
> >> return (y.word[lw] == 1954);
> >> }
> >> return 0;
> >> }
> >>
> >> ieee_double is just used for type punning so we can check the final
> >> bits and see if they're equal to 1954; if they are, x is NA, if
> >> they're not, x is NaN (as defined for R_IsNaN).
> >>
> >> My question is -- I can see a substantial increase in speed (on my
> >> computer, in certain cases) if I replace this check with
> >>
> >> int R_IsNA(double x)
> >> {
> >> return memcmp(
> >> (char*)(&x),
> >> (char*)(&NA_REAL),
> >> sizeof(double)
> >> ) == 0;
> >> }
> >>
> >> IIUC, there is only one bit pattern used to encode R NA values, so
> >> this should be safe. But I would like to be sure:
> >>
> >> Is there any guarantee that the different functions in R would return
> >> NA as identical to the bit pattern defined for NA_REAL, for a given
> >> architecture? Similarly for NaN value(s) and R_NaN?
> >>
> >> My guess is that it is possible some functions used internally by R
> >> might encode NaN values differently; ie, setting the lower word to a
> >> value different than 1954 (hence being NaN, but potentially not
> >> identical to R_NaN), or perhaps this is architecture-dependent.
> >> However, NA should be one specific bit pattern (?). And, I wonder if
> >> there is any guarantee that the different functions used in R would
> >> return an NaN value as identical to R_NaN (which appears to be the
> >> 'IEEE NaN')?
> >>
> >> (interested parties can see + run a simple benchmark from the gist at
> >> https://gist.github.com/kevinushey/8911432)
> >>
> >> Thanks,
> >> Kevin
> >>
> >> ______________________________________________
> >> R-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> >
> >--
> >Brian D. Ripley, ripley at stats.ox.ac.uk
> >Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> >University of Oxford, Tel: +44 1865 272861 (self)
> >1 South Parks Road, +44 1865 272866 (PA)
> >Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list