[Rd] A suggestion for an amendment to tapply
Andrew Robinson
A.Robinson at ms.unimelb.edu.au
Thu Nov 8 00:45:27 CET 2007
On Wed, Nov 07, 2007 at 08:15:17AM +0100, Peter Dalgaard wrote:
> Andrew Robinson wrote:
> >These are important concerns. It seems to me that adding an argument
> >as suggested by Bill will allow the user to side-step the problem
> >identified by Brian.
> >
> >Bill, under what kinds of circumstances would you anticipate a
> >significant time penalty? I would be happy to check those out with
> >some simulations.
> >
> >If the timing seems acceptable, I can write a patch for tapply.R and
> >tapply.Rd if anyone in the core is willing to consider them. Please
> >contact me on or off list if so.
> >
> >
>
> There's another concern: tapply (et al.) has the ... args passed on to
> FUN which means that you have to be really careful with argument names.
>
> Could I just interject that we already have
>
> > airquality$Month <- factor(airquality$Month,levels=4:9) # April not there
> > unlist(lapply(
> + split(airquality$Ozone, airquality$Month, drop=F),sum, na.rm=T))
> 4 5 6 7 8 9
> 0 614 265 1537 1559 912
>
> (splitting on multiple factors gets a bit involved, though)
For that matter, we have
airquality$Month <- factor(airquality$Month,levels=4:9)
air.sum <- tapply(airquality$Ozone, airquality$Month, sum, na.rm=T)
air.sum[is.na(air.sum)] <- 0
which is equivalent to what I ended up using whilst fiddling with tapply.
Andrew
> >Best wishes to all,
> >
> >Andrew
> >
> >
> >
> >
> >On Tue, Nov 06, 2007 at 07:23:56AM +0000, Prof Brian Ripley wrote:
> >
> >>On Tue, 6 Nov 2007, Bill.Venables at csiro.au wrote:
> >>
> >>
> >>>Unfortunately I think it would break too much existing code. tapply()
> >>>is an old function and many people have gotten used to the way it works
> >>>now.
> >>>
> >>It is also not necessarily desirable: FUN(numeric(0)) might be an error.
> >>For example:
> >>
> >>
> >>>Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ]
> >>>tapply(Z$x, Z$f, sd)
> >>>
> >>but sd(numeric(0)) is an error. (Similar things involving var are 'in
> >>the wild' and so would be broken.)
> >>
> >>
> >>>This is not to suggest there could not be another argument added at the
> >>>end to indicate that you want the new behaviour, though. e.g.
> >>>
> >>>tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
> >>>handle.empty.levels = FALSE)
> >>>
> >>>but this raises the question of what sort of time penalty the
> >>>modification might entail. Probably not much for most situations, I
> >>>suppose. (I know this argument name looks long, but you do need a
> >>>fairly specific argument name, or it will start to impinge on the ...
> >>>argument.)
> >>>
> >>>Just some thoughts.
> >>>
> >>>Bill Venables.
> >>>
> >>>Bill Venables
> >>>CSIRO Laboratories
> >>>PO Box 120, Cleveland, 4163
> >>>AUSTRALIA
> >>>Office Phone (email preferred): +61 7 3826 7251
> >>>Fax (if absolutely necessary): +61 7 3826 7304
> >>>Mobile: +61 4 8819 4402
> >>>Home Phone: +61 7 3286 7700
> >>>mailto:Bill.Venables at csiro.au
> >>>http://www.cmis.csiro.au/bill.venables/
> >>>
> >>>-----Original Message-----
> >>>From: r-devel-bounces at r-project.org
> >>>[mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson
> >>>Sent: Tuesday, 6 November 2007 3:10 PM
> >>>To: R-Devel
> >>>Subject: [Rd] A suggestion for an amendment to tapply
> >>>
> >>>Dear R-developers,
> >>>
> >>>when tapply() is invoked on factors that have empty levels, it returns
> >>>NA. This behaviour is in accord with the tapply documentation, and is
> >>>reasonable in many cases. However, when FUN is sum, it would also
> >>>seem reasonable to return 0 instead of NA, because "the sum of an
> >>>empty set is zero, by definition."
> >>>
> >>>I'd like to raise a discussion of the possibility of an amendment to
> >>>tapply.
> >>>
> >>>The attached patch changes the function so that it checks if there are
> >>>any empty levels, and if there are, replaces the corresponding NA
> >>>values with the result of applying FUN to the empty set. Eg in the
> >>>case of sum, it replaces the NA with 0, whereas with mean, it replaces
> >>>the NA with NA, and issues a warning.
> >>>
> >>>This change has the following advantage: tapply and sum work better
> >>>together. Arguably, tapply and any other function that has a non-NA
> >>>response to the empty set will also work better together.
> >>>Furthermore, tapply shows a warning if FUN would normally show a
> >>>warning upon being evaluated on an empty set. That deviates from
> >>>current behaviour, which might be bad, but also provides information
> >>>that might be useful to the user, so that would be good.
> >>>
> >>>The attached script provides the new function in full, and
> >>>demonstrates its application in some simple test cases.
> >>>
> >>>Best wishes,
> >>>
> >>>Andrew
> >>>
> >>>
> >>--
> >>Brian D. Ripley, ripley at stats.ox.ac.uk
> >>Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> >>University of Oxford, Tel: +44 1865 272861 (self)
> >>1 South Parks Road, +44 1865 272866 (PA)
> >>Oxford OX1 3TG, UK Fax: +44 1865 272595
> >>
> >
> >
>
>
> --
> O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B
> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
> (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
--
Andrew Robinson
Department of Mathematics and Statistics Tel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/
More information about the R-devel
mailing list