[Rd] A suggestion for an amendment to tapply

Peter Dalgaard p.dalgaard at biostat.ku.dk
Wed Nov 7 08:15:17 CET 2007

Andrew Robinson wrote:
> These are important concerns.  It seems to me that adding an argument
> as suggested by Bill will allow the user to side-step the problem
> identified by Brian.
> Bill, under what kinds of circumstances would you anticipate a
> significant time penalty?  I would be happy to check those out with
> some simulations.
> If the timing seems acceptable, I can write a patch for tapply.R and
> tapply.Rd if anyone in the core is willing to consider them.  Please
> contact me on or off list if so.

There's another concern: tapply (et al.) has the ... args passed on to 
FUN which means that you have to be really careful with argument names.

Could I just interject that we already have

 > airquality$Month <- factor(airquality$Month,levels=4:9) # April not there
 > unlist(lapply(
+    split(airquality$Ozone, airquality$Month, drop=F),sum, na.rm=T))
   4    5    6    7    8    9
   0  614  265 1537 1559  912

(splitting on multiple factors gets a  bit involved, though)

> Best wishes to all,
> Andrew
> On Tue, Nov 06, 2007 at 07:23:56AM +0000, Prof Brian Ripley wrote:
>> On Tue, 6 Nov 2007, Bill.Venables at csiro.au wrote:
>>> Unfortunately I think it would break too much existing code.  tapply()
>>> is an old function and many people have gotten used to the way it works
>>> now.
>> It is also not necessarily desirable: FUN(numeric(0)) might be an error.
>> For example:
>>> Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ]
>>> tapply(Z$x, Z$f, sd)
>> but sd(numeric(0)) is an error.  (Similar things involving var are 'in the 
>> wild' and so would be broken.)
>>> This is not to suggest there could not be another argument added at the
>>> end to indicate that you want the new behaviour, though.  e.g.
>>> tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
>>> handle.empty.levels = FALSE)
>>> but this raises the question of what sort of time penalty the
>>> modification might entail.  Probably not much for most situations, I
>>> suppose.  (I know this argument name looks long, but you do need a
>>> fairly specific argument name, or it will start to impinge on the ...
>>> argument.)
>>> Just some thoughts.
>>> Bill Venables.
>>> Bill Venables
>>> CSIRO Laboratories
>>> PO Box 120, Cleveland, 4163
>>> Office Phone (email preferred): +61 7 3826 7251
>>> Fax (if absolutely necessary):  +61 7 3826 7304
>>> Mobile:                         +61 4 8819 4402
>>> Home Phone:                     +61 7 3286 7700
>>> mailto:Bill.Venables at csiro.au
>>> http://www.cmis.csiro.au/bill.venables/
>>> -----Original Message-----
>>> From: r-devel-bounces at r-project.org
>>> [mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
>>> Sent: Tuesday, 6 November 2007 3:10 PM
>>> To: R-Devel
>>> Subject: [Rd] A suggestion for an amendment to tapply
>>> Dear R-developers,
>>> when tapply() is invoked on factors that have empty levels, it returns
>>> NA.  This behaviour is in accord with the tapply documentation, and is
>>> reasonable in many cases.  However, when FUN is sum, it would also
>>> seem reasonable to return 0 instead of NA, because "the sum of an
>>> empty set is zero, by definition."
>>> I'd like to raise a discussion of the possibility of an amendment to
>>> tapply.
>>> The attached patch changes the function so that it checks if there are
>>> any empty levels, and if there are, replaces the corresponding NA
>>> values with the result of applying FUN to the empty set.  Eg in the
>>> case of sum, it replaces the NA with 0, whereas with mean, it replaces
>>> the NA with NA, and issues a warning.
>>> This change has the following advantage: tapply and sum work better
>>> together.  Arguably, tapply and any other function that has a non-NA
>>> response to the empty set will also work better together.
>>> Furthermore, tapply shows a warning if FUN would normally show a
>>> warning upon being evaluated on an empty set.  That deviates from
>>> current behaviour, which might be bad, but also provides information
>>> that might be useful to the user, so that would be good.
>>> The attached script provides the new function in full, and
>>> demonstrates its application in some simple test cases.
>>> Best wishes,
>>> Andrew
>> -- 
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

More information about the R-devel mailing list