[R] by inconsistently strips class - with fix

Alex Brown fishtank at compsoc.man.ac.uk
Tue Apr 15 13:28:15 CEST 2008


summary:

The function 'by' inconsistently strips class from the data to which  
it is applied.

quick reason:

tapply strips class when simplify is set to TRUE (the default) due to  
the class stripping behaviour of unlist.

quick answer:

This can be fixed by invoking tapply with simplify=FALSE, or changing  
tapply to use do.call(c instead of unlist

executable example:

mytimes=data.frame(date = 1:3 + Sys.time(), set = c(1,1,2))

by(mytimes$date, mytimes$set, function(x)x)

INDICES: 1
[1] "2008-04-15 11:41:38 BST" "2008-04-15 11:41:39 BST"
----------------------------------------------------------------------------------------
INDICES: 2
[1] "2008-04-15 11:41:40 BST"

by(mytimes[1,]$date, mytimes[1,]$set, function(x)x)

INDICES: 1
[1] 1208256099

why this is a problem:

This is a problem when you are feeding the output of this by into a  
function which expects the class to be maintained.  I see this problem  
when constructing

reason:

tapply strips class when simplify is set to TRUE (the default) due to  
the behaviour of unlist:

"Where possible the list elements are coerced to a common mode during  
the unlisting, and so the result often ends up as a character vector.  
Vectors will be coerced to the highest type of the components in the  
hierarchy NULL < raw < logical < integer < real < complex < character  
< list < expression: pairlists are treated as lists."

solution:

This problem can be fixed in the function by.data.frame by modifying  
the call to tapply in the function "by":

by.data.frame = function (data, INDICES, FUN, ...)
{
   if (!is.list(INDICES)) {
       IND <- vector("list", 1)
       IND[[1]] <- INDICES
       names(IND) <- deparse(substitute(INDICES))[1]
   }
   else IND <- INDICES
   FUNx <- function(x) FUN(data[x, ], ...)
   nd <- nrow(data)
<<<<
   ans <- eval(substitute(tapply(1:nd, IND, FUNx)), data)
====
   ans <- eval(substitute(tapply(1:nd, IND, FUNx, simplify=FALSE)),  
data)
 >>>>
   attr(ans, "call") <- match.call()
   class(ans) <- "by"
   ans
}

alternative solution:

the call in tapply to unlist(ans, recursive=F) can be replaced by  
do.call(c,ans, recursive=F) to fix this issue, since c does not strip  
class.

However, I haven't taken the time to work out if this will work in all  
cases.

for example:

function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
   FUN <- if (!is.null(FUN))
       match.fun(FUN)
   if (!is.list(INDEX))
       INDEX <- list(INDEX)
   nI <- length(INDEX)
   namelist <- vector("list", nI)
   names(namelist) <- names(INDEX)
   extent <- integer(nI)
   nx <- length(X)
   one <- 1L
   group <- rep.int(one, nx)
   ngroup <- one
   for (i in seq.int(INDEX)) {
       index <- as.factor(INDEX[[i]])
       if (length(index) != nx)
           stop("arguments must have same length")
       namelist[[i]] <- levels(index)
       extent[i] <- nlevels(index)
       group <- group + ngroup * (as.integer(index) - one)
       ngroup <- ngroup * nlevels(index)
   }
   if (is.null(FUN))
       return(group)
   ans <- lapply(split(X, group), FUN, ...)
   index <- as.integer(names(ans))
   if (simplify && all(unlist(lapply(ans, length)) == 1)) {
       ansmat <- array(dim = extent, dimnames = namelist)
<<<<
       ans <- unlist(ans, recursive = FALSE)
====
	ans <- do.call(c, ans, recursive = FALSE)
 >>>>
   }
   else {
       ansmat <- array(vector("list", prod(extent)), dim = extent,
           dimnames = namelist)
   }
   if (length(index)) {
       names(ans) <- NULL
       ansmat[index] <- ans
   }
   ansmat
}

Alexander Brown
Principal Engineer
Transitive
Maybrook House, 40 Blackfriars Street, Manchester M3 2EG
Phone: +44 (0)161 836 2321    Fax: +44 (0)161 836 2399    Mobile: +44  
(0)7980 708 221
www.transitive.com
* The leader in cross-platform virtualization



More information about the R-help mailing list