[Rd] table(exclude = NULL) always includes NA

Martin Maechler maechler at stat.math.ethz.ch
Fri Aug 12 10:12:01 CEST 2016


>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>>     on Thu, 11 Aug 2016 16:19:49 +0000 writes:

    > I stand corrected. The part "If set to 'NULL', it implies
    > 'useNA="always"'." is even in the documentation in R
    > 2.8.0. It was my fault not to check carefully.  I wonder,
    > why "always" was chosen for 'useNA' for exclude = NULL.

me too.  "ifany" would seem more logical, and I am considering
changing to that as a 2nd step (if the 1st step, below) shows to
be feasible.

    > Why exclude = NULL is so special? What about another
    > 'exclude' of length zero, like character(0) (not c(),
    > because c() is NULL)? I thought that, too. But then, I
    > have no opinion about making it general.

As mentioned, I entirely agree with that {and you are right
about c() !!}.

    > It fits my expectation to override 'useNA' only if the
    > user doesn't explicitly specify 'useNA'.

    > Thank you for looking into this.

you are welcome.
As first step, I plan to commit the change to (*)

 useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always"

as proposed yesterday,  and I'll eventually see / be notified
about the effect in CRAN space.

--
(*) slightly more efficiently, I'll be using match() directly instead of %in%

    > My points:
    > Could R 2.7.2 behavior of table(<non-factor>, exclude = NULL) be brought back? But R 3.3.1 behavior is in R since version 2.8.0, rather long.

you are right... but then, the places / cases where the behavior
would change back should be quite rare.

    > If not, I suggest changing summary(<logical>).
    > --------------------------------------------

Thank you for your feedback, Suharto!
Martin

    > On Thu, 11/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
    > 
    >  Subject: Re: [Rd] table(exclude = NULL) always includes NA
    > 
    > @r-project.org
    >  Cc: "Martin Maechler" <maechler at stat.math.ethz.ch>
    >  Date: Thursday, 11 August, 2016, 12:39 AM
    > 
    > >>>>> Martin Maechler <maechler at stat.math.ethz.ch>
    > >>>>>     on Tue, 9 Aug 2016 15:35:41 +0200 writes:
    > 
    > >>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
    > >>>>>     on Sun, 7 Aug 2016 15:32:19 +0000 writes:
    > 
    > > > This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html .
    > > 
    > > > With R 2.7.2:
    > > 
    > > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
    > > > > table(a, b, exclude = NULL)
    > > >       b
    > > > a      1 2
    > > >   1    1 1
    > > >   2    2 0
    > > >   3    1 0
    > > >   <NA> 1 0
    > > 
    > > > With R 3.3.1:
    > > 
    > > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
    > > > > table(a, b, exclude = NULL)
    > > >       b
    > > > a      1 2 <NA>
    > > >   1    1 1    0
    > > >   2    2 0    0
    > > >   3    1 0    0
    > > >   <NA> 1 0    0
    > > > > table(a, b, useNA = "ifany")
    > > >       b
    > > > a      1 2
    > > >   1    1 1
    > > >   2    2 0
    > > >   3    1 0
    > > >   <NA> 1 0
    > > > > table(a, b, exclude = NULL, useNA = "ifany")
    > > >       b
    > > > a      1 2 <NA>
    > > >   1    1 1    0
    > > >   2    2 0    0
    > > >   3    1 0    0
    > > >   <NA> 1 0    0
    > > 
    > > > For the example, in R 3.3.1, the result of 'table' with
    > > > exclude = NULL includes NA even if NA is not present. It is
    > > > different from R 2.7.2, that comes from factor(exclude = NULL), 
    > > > that includes NA only if NA is present.
    > > 
    > > I agree that this (R 3.3.1 behavior) seems undesirable and looks
    > > wrong, and the old (<= 2.2.7) behavior for  table(a,b,
    > > exclude=NULL) seems desirable to me.
    > > 
    > > 
    > > > >From R 3.3.1 help on 'table', in "Details" section:
    > > > 'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts.  This is overridden by specifying 'exclude = NULL'.
    > > 
    > > > Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always".
    > > 
    > > Yes, it should be documented what happens for this case,
    > > (but read on ...)
    > 
    > and it is *not* true that the documentation does not say, since
    > 2013, it has contained
    > 
    > exclude: levels to remove for all factors in ‘...’.  If set to ‘NULL’,
    >           it implies ‘useNA = "always"’.  See ‘Details’ for its
    >           interpretation for non-factor arguments.
    > 
    > 
    > > > For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified.
    > > 
    > > Yes.  What should we do?
    > > I currently think that we'd want to change the line
    > > 
    > >      useNA <- if (!missing(exclude) && is.null(exclude)) "always"
    > > 
    > > to
    > > 
    > >      useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was "always"
    > > 
    > > 
    > > which would not even contradict documentation, as indeed you
    > > mentioned above, the exact action here had not been documented.
    > 
    > The last part ("which ..") above is wrong, as mentioned earlier.
    > 
    > The above change entails behaviour which looks better to me;
    > however, the change *is* "against the current documentation".
    > and after experimentation (a "complete factorial design" of
    > argument settings), I'm not entirely happy with the result.... and one reason
    > is that   'exclude = NULL'  and  (e.g.)   'exclude = c()'
    > are (still) handled differently: From a usual interpreation,
    > both should mean 
    >   "do not exclude any factor entries (and levels) from tabulation"
    > but one of the two changes the default of 'useNA' and the other
    > does not.   If we want a change anyway (and have to update the doc),
    > it could be "more logical"  to replace the line above by
    > 
    >    useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always"
    > 
    > notably, replacing 'useNA' *only* if it has not been specified,
    > which seems much closer to "typically expected" behavior..
    > 
    > 
    > 
    > 
    > > The change above at least does not break any of the standard R
    > > tests ('make check-all', i.e., including the recommended
    > > packages), which for me confirms that it may be "what is
    > > best"...
    > > 
    > > ----
    > > 
    > > Thank you for mentioning the important consequence for summary(<logical>).
    > > They can helping insight what a "probably best" behavior should
    > > be for these cases of table().
    > > 
    > > Martin Maechler,
    > > ETH Zurich
    > > 
    > > > The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used.
    > > 
    > > > With R 2.7.2:
    > > 
    > > > > log <- c(NA, logical(4), NA, !logical(2), NA)
    > > > > summary(log)
    > > >    Mode   FALSE    TRUE    NA's
    > > > logical       4       2       3
    > > > > summary(log[!is.na(log)])
    > > >    Mode   FALSE    TRUE
    > > > logical       4       2
    > > > > summary(TRUE)
    > > >    Mode    TRUE
    > > > logical       1
    > > 
    > > > With R 3.3.1:
    > > 
    > > > > log <- c(NA, logical(4), NA, !logical(2), NA)
    > > > > summary(log)
    > > >    Mode   FALSE    TRUE    NA's
    > > > logical       4       2       3
    > > > > summary(log[!is.na(log)])
    > > >    Mode   FALSE    TRUE    NA's
    > > > logical       4       2       0
    > > > > summary(TRUE)
    > > >    Mode    TRUE    NA's
    > > > logical       1       0
    > > 
    > > > In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector.
    > > > On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't  contain FALSE.
    > > 
    > > > I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA.
    > > 
    > > I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior
    > > for table() {and hence summary(<logical>)}.
    >



More information about the R-devel mailing list