[R] by inconsistently strips class - with fix

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Apr 17 08:03:33 CEST 2008


Unfortunately your proposed change changes the type of the output: 
simplification is intended in many applications of by().

Before:

> str(by(mytimes$date[1], mytimes$set[1], function(x)x))
  by [, 1] 1.21e+09
  - attr(*, "dimnames")=List of 1
   ..$ mytimes$set[1]: chr "1"
  - attr(*, "call")= language by.default(data = mytimes$date[1], INDICES = 
mytimes$set[1],      FUN = function(x) x)

After:

> str(by(mytimes$date[1], mytimes$set[1], function(x)x))
List of 1
  $ 1: POSIXct[1:1], format: "2008-04-17 06:53:31"
  - attr(*, "dim")= int 1
  - attr(*, "dimnames")=List of 1
   ..$ mytimes$set[1]: chr "1"
  - attr(*, "call")= language by.default(data = mytimes$date[1], INDICES = 
mytimes$set[1],      FUN = function(x) x)
  - attr(*, "class")= chr "by"

c() does not do the same thing as unlist() in general, and it is untrue 
that 'c does not strip class'.  What happens in your example is that there 
is a c() method for your class (and not many others).

What we could is to add a 'simplify' argument to by() so you can control 
the simplification.


On Tue, 15 Apr 2008, Alex Brown wrote:

> summary:
>
> The function 'by' inconsistently strips class from the data to which
> it is applied.
>
> quick reason:
>
> tapply strips class when simplify is set to TRUE (the default) due to
> the class stripping behaviour of unlist.
>
> quick answer:
>
> This can be fixed by invoking tapply with simplify=FALSE, or changing
> tapply to use do.call(c instead of unlist
>
> executable example:
>
> mytimes=data.frame(date = 1:3 + Sys.time(), set = c(1,1,2))
>
> by(mytimes$date, mytimes$set, function(x)x)
>
> INDICES: 1
> [1] "2008-04-15 11:41:38 BST" "2008-04-15 11:41:39 BST"
> ----------------------------------------------------------------------------------------
> INDICES: 2
> [1] "2008-04-15 11:41:40 BST"
>
> by(mytimes[1,]$date, mytimes[1,]$set, function(x)x)
>
> INDICES: 1
> [1] 1208256099
>
> why this is a problem:
>
> This is a problem when you are feeding the output of this by into a
> function which expects the class to be maintained.  I see this problem
> when constructing
>
> reason:
>
> tapply strips class when simplify is set to TRUE (the default) due to
> the behaviour of unlist:
>
> "Where possible the list elements are coerced to a common mode during
> the unlisting, and so the result often ends up as a character vector.
> Vectors will be coerced to the highest type of the components in the
> hierarchy NULL < raw < logical < integer < real < complex < character
> < list < expression: pairlists are treated as lists."
>
> solution:
>
> This problem can be fixed in the function by.data.frame by modifying
> the call to tapply in the function "by":
>
> by.data.frame = function (data, INDICES, FUN, ...)
> {
>   if (!is.list(INDICES)) {
>       IND <- vector("list", 1)
>       IND[[1]] <- INDICES
>       names(IND) <- deparse(substitute(INDICES))[1]
>   }
>   else IND <- INDICES
>   FUNx <- function(x) FUN(data[x, ], ...)
>   nd <- nrow(data)
> <<<<
>   ans <- eval(substitute(tapply(1:nd, IND, FUNx)), data)
> ====
>   ans <- eval(substitute(tapply(1:nd, IND, FUNx, simplify=FALSE)),
> data)
> >>>>
>   attr(ans, "call") <- match.call()
>   class(ans) <- "by"
>   ans
> }
>
> alternative solution:
>
> the call in tapply to unlist(ans, recursive=F) can be replaced by
> do.call(c,ans, recursive=F) to fix this issue, since c does not strip
> class.
>
> However, I haven't taken the time to work out if this will work in all
> cases.
>
> for example:
>
> function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> {
>   FUN <- if (!is.null(FUN))
>       match.fun(FUN)
>   if (!is.list(INDEX))
>       INDEX <- list(INDEX)
>   nI <- length(INDEX)
>   namelist <- vector("list", nI)
>   names(namelist) <- names(INDEX)
>   extent <- integer(nI)
>   nx <- length(X)
>   one <- 1L
>   group <- rep.int(one, nx)
>   ngroup <- one
>   for (i in seq.int(INDEX)) {
>       index <- as.factor(INDEX[[i]])
>       if (length(index) != nx)
>           stop("arguments must have same length")
>       namelist[[i]] <- levels(index)
>       extent[i] <- nlevels(index)
>       group <- group + ngroup * (as.integer(index) - one)
>       ngroup <- ngroup * nlevels(index)
>   }
>   if (is.null(FUN))
>       return(group)
>   ans <- lapply(split(X, group), FUN, ...)
>   index <- as.integer(names(ans))
>   if (simplify && all(unlist(lapply(ans, length)) == 1)) {
>       ansmat <- array(dim = extent, dimnames = namelist)
> <<<<
>       ans <- unlist(ans, recursive = FALSE)
> ====
> 	ans <- do.call(c, ans, recursive = FALSE)
> >>>>
>   }
>   else {
>       ansmat <- array(vector("list", prod(extent)), dim = extent,
>           dimnames = namelist)
>   }
>   if (length(index)) {
>       names(ans) <- NULL
>       ansmat[index] <- ans
>   }
>   ansmat
> }
>
> Alexander Brown
> Principal Engineer
> Transitive
> Maybrook House, 40 Blackfriars Street, Manchester M3 2EG
> Phone: +44 (0)161 836 2321    Fax: +44 (0)161 836 2399    Mobile: +44
> (0)7980 708 221
> www.transitive.com
> * The leader in cross-platform virtualization
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list