[Rd] RFC: tapply(*, ..., init.value = NA)

Fri Jan 27 16:20:18 CET 2017

If xtabs is enhanced then as.data.frame.table may also need to be
modified so that it continues to be usable as an inverse, at least to
the degree feasible.

On Thu, Jan 26, 2017 at 5:42 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
> Last week, we've talked here about "xtabs(), factors and NAs",
>  ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>
> In the mean time, I've spent several hours on the issue
> and also committed changes to R-devel "in two iterations".
>
> In the case there is a *Left* hand side part to xtabs() formula,
> see the help page example using 'esoph',
> it uses  tapply(...,  FUN = sum)   and
> I now think there is a missing feature in tapply() there, which
> I am proposing to change.
>
> Look at a small example:
>
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2)
> , , N = 3
>
>    L
> n   A B C D E F
>   1 1 2 0 0 0 0
>   2 0 0 1 2 0 0
>   3 0 0 0 0 2 2
>
>> DN <- D2; DN[1,"N"] <- NA; DN
>    n L  N
> 2  1 A NA
> 3  1 B  3
> 4  1 B  3
> 6  2 C  3
> 7  2 D  3
> 8  2 D  3
> 9  3 E  3
> 10 3 E  3
> 11 3 F  3
> 12 3 F  3
>> with(DN, tapply(N, list(n,L), FUN=sum))
>    A  B  C  D  E  F
> 1 NA  6 NA NA NA NA
> 2 NA NA  3  6 NA NA
> 3 NA NA NA NA  6  6
>>
>
> and as you can see, the resulting matrix has NAs, all the same
> NA_real_, but semantically of two different kinds:
>
> 1) at ["1", "A"], the  NA  comes from the NA in 'N'
> 2) all other NAs come from the fact that there is no such factor combination
>    *and* from the fact that tapply() uses
>
>    array(dim = .., dimnames = ...)
>
> i.e., initializes the array with NAs  (see definition of 'array').
>
> My proposition is the following patch to  tapply(), adding a new
> option 'init.value':
>
> -----------------------------------------------------------------------------
>
> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>  {
>      FUN <- if (!is.null(FUN)) match.fun(FUN)
>      if (!is.list(INDEX)) INDEX <- list(INDEX)
> @@ -44,7 +44,7 @@
>      index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
>      ans <- lapply(X = ans[index], FUN = FUN, ...)
>      if (simplify && all(lengths(ans) == 1L)) {
> -       ansmat <- array(dim = extent, dimnames = namelist)
> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
>         ans <- unlist(ans, recursive = FALSE)
>      } else {
>         ansmat <- array(vector("list", prod(extent)),
>
> -----------------------------------------------------------------------------
>
> With that, I can set the initial value to '0' instead of array's
> default of NA :
>
>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>    A B C D E F
> 1 NA 6 0 0 0 0
> 2  0 0 3 6 0 0
> 3  0 0 0 0 6 6
>>
>
> which now has 0 counts and NA  as is desirable to be used inside
> xtabs().
>
> All fine... and would not be worth a posting to R-devel,
> except for this:
>
> The change will not be 100% back compatible -- by necessity: any new argument for
> tapply() will make that argument name not available to be
> specified (via '...') for 'FUN'.  The new function would be
>
>> str(tapply)
> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>
> where the '...' are passed FUN(),  and with the new signature,
> 'init.value' then won't be passed to FUN  "anymore" (compared to
> R <= 3.3.x).
>
> For that reason, we could use   'INIT.VALUE' instead (possibly decreasing
> the probability the arg name is used in other functions).
>
>
> Opinions?
>
> Thank you in advance,
> Martin
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com