[Rd] table(exclude = NULL) always includes NA
Martin Maechler
maechler at stat.math.ethz.ch
Fri Aug 12 10:12:01 CEST 2016
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>> on Thu, 11 Aug 2016 16:19:49 +0000 writes:
> I stand corrected. The part "If set to 'NULL', it implies
> 'useNA="always"'." is even in the documentation in R
> 2.8.0. It was my fault not to check carefully. I wonder,
> why "always" was chosen for 'useNA' for exclude = NULL.
me too. "ifany" would seem more logical, and I am considering
changing to that as a 2nd step (if the 1st step, below) shows to
be feasible.
> Why exclude = NULL is so special? What about another
> 'exclude' of length zero, like character(0) (not c(),
> because c() is NULL)? I thought that, too. But then, I
> have no opinion about making it general.
As mentioned, I entirely agree with that {and you are right
about c() !!}.
> It fits my expectation to override 'useNA' only if the
> user doesn't explicitly specify 'useNA'.
> Thank you for looking into this.
you are welcome.
As first step, I plan to commit the change to (*)
useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always"
as proposed yesterday, and I'll eventually see / be notified
about the effect in CRAN space.
--
(*) slightly more efficiently, I'll be using match() directly instead of %in%
> My points:
> Could R 2.7.2 behavior of table(<non-factor>, exclude = NULL) be brought back? But R 3.3.1 behavior is in R since version 2.8.0, rather long.
you are right... but then, the places / cases where the behavior
would change back should be quite rare.
> If not, I suggest changing summary(<logical>).
> --------------------------------------------
Thank you for your feedback, Suharto!
Martin
> On Thu, 11/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>
> Subject: Re: [Rd] table(exclude = NULL) always includes NA
>
> @r-project.org
> Cc: "Martin Maechler" <maechler at stat.math.ethz.ch>
> Date: Thursday, 11 August, 2016, 12:39 AM
>
> >>>>> Martin Maechler <maechler at stat.math.ethz.ch>
> >>>>> on Tue, 9 Aug 2016 15:35:41 +0200 writes:
>
> >>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
> >>>>> on Sun, 7 Aug 2016 15:32:19 +0000 writes:
>
> > > This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html .
> >
> > > With R 2.7.2:
> >
> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
> > > > table(a, b, exclude = NULL)
> > > b
> > > a 1 2
> > > 1 1 1
> > > 2 2 0
> > > 3 1 0
> > > <NA> 1 0
> >
> > > With R 3.3.1:
> >
> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
> > > > table(a, b, exclude = NULL)
> > > b
> > > a 1 2 <NA>
> > > 1 1 1 0
> > > 2 2 0 0
> > > 3 1 0 0
> > > <NA> 1 0 0
> > > > table(a, b, useNA = "ifany")
> > > b
> > > a 1 2
> > > 1 1 1
> > > 2 2 0
> > > 3 1 0
> > > <NA> 1 0
> > > > table(a, b, exclude = NULL, useNA = "ifany")
> > > b
> > > a 1 2 <NA>
> > > 1 1 1 0
> > > 2 2 0 0
> > > 3 1 0 0
> > > <NA> 1 0 0
> >
> > > For the example, in R 3.3.1, the result of 'table' with
> > > exclude = NULL includes NA even if NA is not present. It is
> > > different from R 2.7.2, that comes from factor(exclude = NULL),
> > > that includes NA only if NA is present.
> >
> > I agree that this (R 3.3.1 behavior) seems undesirable and looks
> > wrong, and the old (<= 2.2.7) behavior for table(a,b,
> > exclude=NULL) seems desirable to me.
> >
> >
> > > >From R 3.3.1 help on 'table', in "Details" section:
> > > 'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts. This is overridden by specifying 'exclude = NULL'.
> >
> > > Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always".
> >
> > Yes, it should be documented what happens for this case,
> > (but read on ...)
>
> and it is *not* true that the documentation does not say, since
> 2013, it has contained
>
> exclude: levels to remove for all factors in ‘...’. If set to ‘NULL’,
> it implies ‘useNA = "always"’. See ‘Details’ for its
> interpretation for non-factor arguments.
>
>
> > > For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified.
> >
> > Yes. What should we do?
> > I currently think that we'd want to change the line
> >
> > useNA <- if (!missing(exclude) && is.null(exclude)) "always"
> >
> > to
> >
> > useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was "always"
> >
> >
> > which would not even contradict documentation, as indeed you
> > mentioned above, the exact action here had not been documented.
>
> The last part ("which ..") above is wrong, as mentioned earlier.
>
> The above change entails behaviour which looks better to me;
> however, the change *is* "against the current documentation".
> and after experimentation (a "complete factorial design" of
> argument settings), I'm not entirely happy with the result.... and one reason
> is that 'exclude = NULL' and (e.g.) 'exclude = c()'
> are (still) handled differently: From a usual interpreation,
> both should mean
> "do not exclude any factor entries (and levels) from tabulation"
> but one of the two changes the default of 'useNA' and the other
> does not. If we want a change anyway (and have to update the doc),
> it could be "more logical" to replace the line above by
>
> useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always"
>
> notably, replacing 'useNA' *only* if it has not been specified,
> which seems much closer to "typically expected" behavior..
>
>
>
>
> > The change above at least does not break any of the standard R
> > tests ('make check-all', i.e., including the recommended
> > packages), which for me confirms that it may be "what is
> > best"...
> >
> > ----
> >
> > Thank you for mentioning the important consequence for summary(<logical>).
> > They can helping insight what a "probably best" behavior should
> > be for these cases of table().
> >
> > Martin Maechler,
> > ETH Zurich
> >
> > > The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used.
> >
> > > With R 2.7.2:
> >
> > > > log <- c(NA, logical(4), NA, !logical(2), NA)
> > > > summary(log)
> > > Mode FALSE TRUE NA's
> > > logical 4 2 3
> > > > summary(log[!is.na(log)])
> > > Mode FALSE TRUE
> > > logical 4 2
> > > > summary(TRUE)
> > > Mode TRUE
> > > logical 1
> >
> > > With R 3.3.1:
> >
> > > > log <- c(NA, logical(4), NA, !logical(2), NA)
> > > > summary(log)
> > > Mode FALSE TRUE NA's
> > > logical 4 2 3
> > > > summary(log[!is.na(log)])
> > > Mode FALSE TRUE NA's
> > > logical 4 2 0
> > > > summary(TRUE)
> > > Mode TRUE NA's
> > > logical 1 0
> >
> > > In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector.
> > > On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't contain FALSE.
> >
> > > I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA.
> >
> > I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior
> > for table() {and hence summary(<logical>)}.
>
More information about the R-devel
mailing list