[Rd] RFC: tapply(*, ..., init.value = NA)

Martin Maechler maechler at stat.math.ethz.ch
Sat Jan 28 16:55:35 CET 2017


>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>     on Fri, 27 Jan 2017 09:46:15 -0800 writes:

    > On Fri, Jan 27, 2017 at 12:34 AM, Martin Maechler
    > <maechler at stat.math.ethz.ch> wrote:
    >> 
    >> > On Jan 26, 2017 07:50, "William Dunlap via R-devel"
    >> <r-devel at r-project.org> > wrote:
    >> 
    >> > It would be cool if the default for tapply's init.value
    >> could be > FUN(X[0]), so it would be 0 for FUN=sum or
    >> FUN=length, TRUE for > FUN=all, -Inf for FUN=max, etc.
    >> But that would take time and would > break code for which
    >> FUN did not work on length-0 objects.
    >> 
    >> > Bill Dunlap > TIBCO Software > wdunlap tibco.com
    >> 
    >> I had the same idea (after my first post), so I agree
    >> that would be nice. One could argue it would take time
    >> only if the user is too lazy to specify the value, and we
    >> could use tryCatch(FUN(X[0]), error = NA) to safeguard
    >> against those functions that fail for 0 length arg.
    >> 
    >> But I think the main reason for _not_ setting such a
    >> default is back-compatibility.  In my proposal, the new
    >> argument would not be any change by default and so all
    >> current uses of tapply() would remain unchanged.
    >> 
    >>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com> on
    >>>>>>> Thu, 26 Jan 2017 07:57:08 -0800 writes:
    >> 
    >> > On a related note, the storage mode should try to match
    >> ans[[1]] (or > unlist:ed and) when allocating 'ansmat' to
    >> avoid coercion and hence a full > copy.
    >> 
    >> Yes, related indeed; and would fall "in line" with Bill's
    >> idea.  OTOH, it could be implemented independently, by
    >> something like
    >> 
    >> if(missing(init.value)) init.value <- if(length(ans))
    >> as.vector(NA, mode=storage.mode(ans[[1]])) else NA

> I would probably do something like:

>   ans <- unlist(ans, recursive = FALSE, use.names = FALSE)
>   if (length(ans)) storage.mode(init.value) <- storage.mode(ans[[1]])
>   ansmat <- array(init.value, dim = extent, dimnames = namelist)

> instead.  That completely avoids having to use missing() and the value
> of 'init.value' will be coerced later if not done upfront.  use.names
> = FALSE speeds up unlist().

Thank you, Henrik.
That's a good idea to do the unlist() first, and with 'use.names=FALSE'.
I'll copy that.

On the other hand, "brutally" modifying  'init.value' (now called 'default')
even when the user has specified it is not acceptable I think.
You are right that it would be coerced anyway subsequently, but
the coercion will happen in whatever method of  `[<-` will be
appropriate.
Good S3 and S4 programmers will write such methods for their classes.

For that reason, I'm even more conservative now, only fiddle in
case of an atomic 'ans' and make use of the corresponding '['
method rather than as.vector(.) ... because that will fulfill
the following new regression test {not fulfilled in current R}:

identical(tapply(1:3, 1:3, as.raw),
	  array(as.raw(1:3), 3L, dimnames=list(1:3)))

Also, I've done a few more things -- treating if(.) . else . as a
function call, etc  and now committed as  rev 72040  to
R-devel... really wanting to get this out.

We can bet if there will be ripples in (visible) package space,
I give it relatively high chance for no ripples (and much higher
chance for problems with the more aggressive proposal..)

Thank you again, for your "thinking along" and constructive
suggestions.

Martin



More information about the R-devel mailing list