[Rd] RFC: tapply(*, ..., init.value = NA)
Martin Maechler
maechler at stat.math.ethz.ch
Fri Jan 27 18:01:22 CET 2017
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
>>>>> on Fri, 27 Jan 2017 16:36:59 +0000 writes:
> The "no factor combination" case is distinguishable by 'tapply' with simplify=FALSE.
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)
>> D2 <- D2[-c(1,5), ]
>> DN <- D2; DN[1,"N"] <- NA
>> with(DN, tapply(N, list(n,L), FUN=sum, simplify=FALSE))
> A B C D E F
> 1 NA 6 NULL NULL NULL NULL
> 2 NULL NULL 3 6 NULL NULL
> 3 NULL NULL NULL NULL 6 6
Yes, I know that simplify=FALSE behaves differently, it returns
a list with dim & dimnames, sometimes also called a "list - matrix"
... and it *can* be used instead, but to be useful needs to be
post processed and that overall is somewhat inefficient and ugly.
> There is an old related discussion starting on https://stat.ethz.ch/pipermail/r-devel/2007-November/047338.html .
Thank you, indeed, for finding that. There Andrew Robinson did
raise the same issue, but his proposed solution was not much
back compatible and I think was primarily dismissed because of that.
Martin
> ----------------------------------
> Last week, we've talked here about "xtabs(), factors and NAs",
-> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
> In the mean time, I've spent several hours on the issue
> and also committed changes to R-devel "in two iterations".
> In the case there is a *Left* hand side part to xtabs() formula,
> see the help page example using 'esoph',
> it uses tapply(..., FUN = sum) and
> I now think there is a missing feature in tapply() there, which
> I am proposing to change.
> Look at a small example:
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2)
> , , N = 3
> L
> n A B C D E F
> 1 1 2 0 0 0 0
> 2 0 0 1 2 0 0
> 3 0 0 0 0 2 2
>> DN <- D2; DN[1,"N"] <- NA; DN
> n L N
> 2 1 A NA
> 3 1 B 3
> 4 1 B 3
> 6 2 C 3
> 7 2 D 3
> 8 2 D 3
> 9 3 E 3
> 10 3 E 3
> 11 3 F 3
> 12 3 F 3
>> with(DN, tapply(N, list(n,L), FUN=sum))
> A B C D E F
> 1 NA 6 NA NA NA NA
> 2 NA NA 3 6 NA NA
> 3 NA NA NA NA 6 6
>>
> and as you can see, the resulting matrix has NAs, all the same
> NA_real_, but semantically of two different kinds:
> 1) at ["1", "A"], the NA comes from the NA in 'N'
> 2) all other NAs come from the fact that there is no such factor combination
> *and* from the fact that tapply() uses
> array(dim = .., dimnames = ...)
> i.e., initializes the array with NAs (see definition of 'array').
> My proposition is the following patch to tapply(), adding a new
> option 'init.value':
> -----------------------------------------------------------------------------
> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
> {
> FUN <- if (!is.null(FUN)) match.fun(FUN)
> if (!is.list(INDEX)) INDEX <- list(INDEX)
> @@ -44,7 +44,7 @@
> index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L
> ans <- lapply(X = ans[index], FUN = FUN, ...)
> if (simplify && all(lengths(ans) == 1L)) {
> - ansmat <- array(dim = extent, dimnames = namelist)
> + ansmat <- array(init.value, dim = extent, dimnames = namelist)
> ans <- unlist(ans, recursive = FALSE)
> } else {
> ansmat <- array(vector("list", prod(extent)),
> -----------------------------------------------------------------------------
> With that, I can set the initial value to '0' instead of array's
> default of NA :
>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
> A B C D E F
> 1 NA 6 0 0 0 0
> 2 0 0 3 6 0 0
> 3 0 0 0 0 6 6
>>
> which now has 0 counts and NA as is desirable to be used inside
> xtabs().
> All fine... and would not be worth a posting to R-devel,
> except for this:
> The change will not be 100% back compatible -- by necessity: any new argument for
> tapply() will make that argument name not available to be
> specified (via '...') for 'FUN'. The new function would be
>> str(tapply)
> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
> where the '...' are passed FUN(), and with the new signature,
> 'init.value' then won't be passed to FUN "anymore" (compared to
> R <= 3.3.x).
> For that reason, we could use 'INIT.VALUE' instead (possibly decreasing
> the probability the arg name is used in other functions).
> Opinions?
> Thank you in advance,
> Martin
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list