[Rd] RFC: tapply(*, ..., init.value = NA)
Martin Maechler
maechler at stat.math.ethz.ch
Thu Jan 26 11:42:27 CET 2017
Last week, we've talked here about "xtabs(), factors and NAs",
-> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
In the mean time, I've spent several hours on the issue
and also committed changes to R-devel "in two iterations".
In the case there is a *Left* hand side part to xtabs() formula,
see the help page example using 'esoph',
it uses tapply(..., FUN = sum) and
I now think there is a missing feature in tapply() there, which
I am proposing to change.
Look at a small example:
> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2)
, , N = 3
L
n A B C D E F
1 1 2 0 0 0 0
2 0 0 1 2 0 0
3 0 0 0 0 2 2
> DN <- D2; DN[1,"N"] <- NA; DN
n L N
2 1 A NA
3 1 B 3
4 1 B 3
6 2 C 3
7 2 D 3
8 2 D 3
9 3 E 3
10 3 E 3
11 3 F 3
12 3 F 3
> with(DN, tapply(N, list(n,L), FUN=sum))
A B C D E F
1 NA 6 NA NA NA NA
2 NA NA 3 6 NA NA
3 NA NA NA NA 6 6
>
and as you can see, the resulting matrix has NAs, all the same
NA_real_, but semantically of two different kinds:
1) at ["1", "A"], the NA comes from the NA in 'N'
2) all other NAs come from the fact that there is no such factor combination
*and* from the fact that tapply() uses
array(dim = .., dimnames = ...)
i.e., initializes the array with NAs (see definition of 'array').
My proposition is the following patch to tapply(), adding a new
option 'init.value':
-----------------------------------------------------------------------------
-tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
+tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
{
FUN <- if (!is.null(FUN)) match.fun(FUN)
if (!is.list(INDEX)) INDEX <- list(INDEX)
@@ -44,7 +44,7 @@
index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L
ans <- lapply(X = ans[index], FUN = FUN, ...)
if (simplify && all(lengths(ans) == 1L)) {
- ansmat <- array(dim = extent, dimnames = namelist)
+ ansmat <- array(init.value, dim = extent, dimnames = namelist)
ans <- unlist(ans, recursive = FALSE)
} else {
ansmat <- array(vector("list", prod(extent)),
-----------------------------------------------------------------------------
With that, I can set the initial value to '0' instead of array's
default of NA :
> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
A B C D E F
1 NA 6 0 0 0 0
2 0 0 3 6 0 0
3 0 0 0 0 6 6
>
which now has 0 counts and NA as is desirable to be used inside
xtabs().
All fine... and would not be worth a posting to R-devel,
except for this:
The change will not be 100% back compatible -- by necessity: any new argument for
tapply() will make that argument name not available to be
specified (via '...') for 'FUN'. The new function would be
> str(tapply)
function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
where the '...' are passed FUN(), and with the new signature,
'init.value' then won't be passed to FUN "anymore" (compared to
R <= 3.3.x).
For that reason, we could use 'INIT.VALUE' instead (possibly decreasing
the probability the arg name is used in other functions).
Opinions?
Thank you in advance,
Martin
More information about the R-devel
mailing list