[Rd] A suggestion for an amendment to tapply

Thu Nov 8 00:45:27 CET 2007

On Wed, Nov 07, 2007 at 08:15:17AM +0100, Peter Dalgaard wrote:
> Andrew Robinson wrote:
> >These are important concerns.  It seems to me that adding an argument
> >as suggested by Bill will allow the user to side-step the problem
> >identified by Brian.
> >
> >Bill, under what kinds of circumstances would you anticipate a
> >significant time penalty?  I would be happy to check those out with
> >some simulations.
> >
> >If the timing seems acceptable, I can write a patch for tapply.R and
> >tapply.Rd if anyone in the core is willing to consider them.  Please
> >contact me on or off list if so.
> >
> >  
> 
> There's another concern: tapply (et al.) has the ... args passed on to 
> FUN which means that you have to be really careful with argument names.
> 
> Could I just interject that we already have
> 
> > airquality$Month <- factor(airquality$Month,levels=4:9) # April not there
> > unlist(lapply(
> +    split(airquality$Ozone, airquality$Month, drop=F),sum, na.rm=T))
>   4    5    6    7    8    9
>   0  614  265 1537 1559  912
> 
> (splitting on multiple factors gets a  bit involved, though)

For that matter, we have

airquality$Month <- factor(airquality$Month,levels=4:9)
air.sum <- tapply(airquality$Ozone, airquality$Month, sum, na.rm=T)
air.sum[is.na(air.sum)] <- 0

which is equivalent to what I ended up using whilst fiddling with tapply.

Andrew

> >Best wishes to all,
> >
> >Andrew
> >
> >
> >
> >
> >On Tue, Nov 06, 2007 at 07:23:56AM +0000, Prof Brian Ripley wrote:
> >  
> >>On Tue, 6 Nov 2007, Bill.Venables at csiro.au wrote:
> >>
> >>    
> >>>Unfortunately I think it would break too much existing code.  tapply()
> >>>is an old function and many people have gotten used to the way it works
> >>>now.
> >>>      
> >>It is also not necessarily desirable: FUN(numeric(0)) might be an error.
> >>For example:
> >>
> >>    
> >>>Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ]
> >>>tapply(Z$x, Z$f, sd)
> >>>      
> >>but sd(numeric(0)) is an error.  (Similar things involving var are 'in 
> >>the wild' and so would be broken.)
> >>
> >>    
> >>>This is not to suggest there could not be another argument added at the
> >>>end to indicate that you want the new behaviour, though.  e.g.
> >>>
> >>>tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
> >>>handle.empty.levels = FALSE)
> >>>
> >>>but this raises the question of what sort of time penalty the
> >>>modification might entail.  Probably not much for most situations, I
> >>>suppose.  (I know this argument name looks long, but you do need a
> >>>fairly specific argument name, or it will start to impinge on the ...
> >>>argument.)
> >>>
> >>>Just some thoughts.
> >>>
> >>>Bill Venables.
> >>>
> >>>Bill Venables
> >>>CSIRO Laboratories
> >>>PO Box 120, Cleveland, 4163
> >>>AUSTRALIA
> >>>Office Phone (email preferred): +61 7 3826 7251
> >>>Fax (if absolutely necessary):  +61 7 3826 7304
> >>>Mobile:                         +61 4 8819 4402
> >>>Home Phone:                     +61 7 3286 7700
> >>>mailto:Bill.Venables at csiro.au
> >>>http://www.cmis.csiro.au/bill.venables/
> >>>
> >>>-----Original Message-----
> >>>From: r-devel-bounces at r-project.org
> >>>[mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
> >>>Sent: Tuesday, 6 November 2007 3:10 PM
> >>>To: R-Devel
> >>>Subject: [Rd] A suggestion for an amendment to tapply
> >>>
> >>>Dear R-developers,
> >>>
> >>>when tapply() is invoked on factors that have empty levels, it returns
> >>>NA.  This behaviour is in accord with the tapply documentation, and is
> >>>reasonable in many cases.  However, when FUN is sum, it would also
> >>>seem reasonable to return 0 instead of NA, because "the sum of an
> >>>empty set is zero, by definition."
> >>>
> >>>I'd like to raise a discussion of the possibility of an amendment to
> >>>tapply.
> >>>
> >>>The attached patch changes the function so that it checks if there are
> >>>any empty levels, and if there are, replaces the corresponding NA
> >>>values with the result of applying FUN to the empty set.  Eg in the
> >>>case of sum, it replaces the NA with 0, whereas with mean, it replaces
> >>>the NA with NA, and issues a warning.
> >>>
> >>>This change has the following advantage: tapply and sum work better
> >>>together.  Arguably, tapply and any other function that has a non-NA
> >>>response to the empty set will also work better together.
> >>>Furthermore, tapply shows a warning if FUN would normally show a
> >>>warning upon being evaluated on an empty set.  That deviates from
> >>>current behaviour, which might be bad, but also provides information
> >>>that might be useful to the user, so that would be good.
> >>>
> >>>The attached script provides the new function in full, and
> >>>demonstrates its application in some simple test cases.
> >>>
> >>>Best wishes,
> >>>
> >>>Andrew
> >>>
> >>>      
> >>-- 
> >>Brian D. Ripley,                  ripley at stats.ox.ac.uk
> >>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> >>University of Oxford,             Tel:  +44 1865 272861 (self)
> >>1 South Parks Road,                     +44 1865 272866 (PA)
> >>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >>    
> >
> >  
> 
> 
> -- 
>   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
> (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

-- 
Andrew Robinson  
Department of Mathematics and Statistics            Tel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia         Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/