[Rd] RFC: tapply(*, ..., init.value = NA)
Martin Maechler
maechler at stat.math.ethz.ch
Fri Jan 27 09:34:09 CET 2017
> On Jan 26, 2017 07:50, "William Dunlap via R-devel" <r-devel at r-project.org>
> wrote:
> It would be cool if the default for tapply's init.value could be
> FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for
> FUN=all, -Inf for FUN=max, etc. But that would take time and would
> break code for which FUN did not work on length-0 objects.
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
I had the same idea (after my first post), so I agree that would
be nice. One could argue it would take time only if the user is too lazy
to specify the value, and we could use
tryCatch(FUN(X[0]), error = NA)
to safeguard against those functions that fail for 0 length arg.
But I think the main reason for _not_ setting such a default is
back-compatibility. In my proposal, the new argument would not
be any change by default and so all current uses of tapply()
would remain unchanged.
>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>> on Thu, 26 Jan 2017 07:57:08 -0800 writes:
> On a related note, the storage mode should try to match ans[[1]] (or
> unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full
> copy.
Yes, related indeed; and would fall "in line" with Bill's idea.
OTOH, it could be implemented independently,
by something like
if(missing(init.value))
init.value <-
if(length(ans)) as.vector(NA, mode=storage.mode(ans[[1]]))
else NA
.............
A colleague proposed to use the shorter argument name 'default'
instead of 'init.value' which indeed maybe more natural and
still not too often used as "non-first" argument in FUN(.).
Thank you for the constructive feedback!
Martin
> On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler
> <maechler at stat.math.ethz.ch> wrote:
>> Last week, we've talked here about "xtabs(), factors and NAs",
-> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>>
>> In the mean time, I've spent several hours on the issue
>> and also committed changes to R-devel "in two iterations".
>>
>> In the case there is a *Left* hand side part to xtabs() formula,
>> see the help page example using 'esoph',
>> it uses tapply(..., FUN = sum) and
>> I now think there is a missing feature in tapply() there, which
>> I am proposing to change.
>>
>> Look at a small example:
>>
>>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
> N=3)[-c(1,5), ]; xtabs(~., D2)
>> , , N = 3
>>
>> L
>> n A B C D E F
>> 1 1 2 0 0 0 0
>> 2 0 0 1 2 0 0
>> 3 0 0 0 0 2 2
>>
>>> DN <- D2; DN[1,"N"] <- NA; DN
>> n L N
>> 2 1 A NA
>> 3 1 B 3
>> 4 1 B 3
>> 6 2 C 3
>> 7 2 D 3
>> 8 2 D 3
>> 9 3 E 3
>> 10 3 E 3
>> 11 3 F 3
>> 12 3 F 3
>>> with(DN, tapply(N, list(n,L), FUN=sum))
>> A B C D E F
>> 1 NA 6 NA NA NA NA
>> 2 NA NA 3 6 NA NA
>> 3 NA NA NA NA 6 6
>>>
>>
>> and as you can see, the resulting matrix has NAs, all the same
>> NA_real_, but semantically of two different kinds:
>>
>> 1) at ["1", "A"], the NA comes from the NA in 'N'
>> 2) all other NAs come from the fact that there is no such factor
> combination
>> *and* from the fact that tapply() uses
>>
>> array(dim = .., dimnames = ...)
>>
>> i.e., initializes the array with NAs (see definition of 'array').
>>
>> My proposition is the following patch to tapply(), adding a new
>> option 'init.value':
>>
>> ------------------------------------------------------------
> -----------------
>>
>> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
>> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify
> = TRUE)
>> {
>> FUN <- if (!is.null(FUN)) match.fun(FUN)
>> if (!is.list(INDEX)) INDEX <- list(INDEX)
>> @@ -44,7 +44,7 @@
>> index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L
>> ans <- lapply(X = ans[index], FUN = FUN, ...)
>> if (simplify && all(lengths(ans) == 1L)) {
>> - ansmat <- array(dim = extent, dimnames = namelist)
>> + ansmat <- array(init.value, dim = extent, dimnames = namelist)
>> ans <- unlist(ans, recursive = FALSE)
>> } else {
>> ansmat <- array(vector("list", prod(extent)),
>>
>> ------------------------------------------------------------
> -----------------
>>
>> With that, I can set the initial value to '0' instead of array's
>> default of NA :
>>
>>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>> A B C D E F
>> 1 NA 6 0 0 0 0
>> 2 0 0 3 6 0 0
>> 3 0 0 0 0 6 6
>>>
>>
>> which now has 0 counts and NA as is desirable to be used inside
>> xtabs().
>>
>> All fine... and would not be worth a posting to R-devel,
>> except for this:
>>
>> The change will not be 100% back compatible -- by necessity: any new
> argument for
>> tapply() will make that argument name not available to be
>> specified (via '...') for 'FUN'. The new function would be
>>
>>> str(tapply)
>> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>>
>> where the '...' are passed FUN(), and with the new signature,
>> 'init.value' then won't be passed to FUN "anymore" (compared to
>> R <= 3.3.x).
>>
>> For that reason, we could use 'INIT.VALUE' instead (possibly decreasing
>> the probability the arg name is used in other functions).
>>
>>
>> Opinions?
>>
>> Thank you in advance,
>> Martin
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> [[alternative HTML version deleted]]
More information about the R-devel
mailing list