[R] Subset by Factor by date

Sat Jun 14 07:46:24 CEST 2008

On Jun 14, 2008, at 1:25 AM, T.D.Rudolph wrote:

>
> aggregate() is indeed a useful function in this case, but it only  
> returns the
> columns by which it was grouped.  Is there a way I can use this while
> simultaneously retaining all the other column values in the dataframe?
>
> e.g. add superfluous (yet pertinent for later) column containing any
> information at all and retain it in the final output

I had exactly this kind of need many times, and I have finally  
created a function for it, which I hope to include soon in an  
upcoming package. Here is a run of it (I added an extra "A" column  
containing just the numbers 1:8):

 > DF
   id      day diff A
1  1 01-01-09  0.5 1
2  1 01-01-09  0.7 2
3  2 01-01-09  0.2 3
4  2 01-01-09  0.4 4
5  1 01-02-09  0.1 5
6  1 01-02-09  0.3 6
7  2 01-02-09  0.3 7
8  2 01-02-09  0.4 8
 > byDataFrame(DF, list(id, day), function(x) x[which.min(x$diff),])
   diff A id      day
1  0.5 1  1 01-01-09
2  0.2 3  2 01-01-09
3  0.1 5  1 01-02-09
4  0.3 7  2 01-02-09

Would that do what you want?

I've appended the function byDataFrame, and its prerequisite, a  
function parseIndexList. I'm not quite set on the names yet, but  
anyway. Hope this helps. I haven't really tested it on large sets, it  
might perform poorly. Any suggestions on speeding the code /  
corrections are welcome.

Haris Skiadas
Department of Mathematics and Computer Science
Hanover College

parseIndexList <- function(indexList) {
   # browser()
   if (!is.list(indexList))
     indexList <- as.list(indexList)
   nI <- length(indexList)
   namelist <- vector("list", nI)
   names(namelist) <- names(indexList)
   extent <- integer(nI)
   nx <- length(indexList[[1]])
   one <- as.integer(1)
   group <- rep.int(one, nx)
   ngroup <- one
   for (i in seq.int(indexList)) {
       index <- as.factor(indexList[[i]])
       if (length(index) != nx)
           stop("arguments must have same length")
       namelist[[i]] <- sort(unique(indexList[[i]]))
       extent[i] <- length(namelist[[i]])
       group <- group + ngroup * (as.integer(index) - one)
       ngroup <- ngroup * nlevels(index)
   }
   nms <- do.call(expand.grid, namelist)
   ind <- unique(sort(group))
   res <- data.frame(index=ind, nms[ind, , drop=FALSE])
   return(list(cases=group, groups=res))
}

byDataFrame <- function (data, INDEX, FUN, newnames,  
omit.index.cols=TRUE, ...) {
# # Part of the code shamelessly stolen from tapply
   IND <- eval(substitute(INDEX), data)
   nms <- as.character(as.list(substitute(INDEX)))
   if (!is.list(IND)) {
     IND <- list(IND)
     names(IND) <- nms
   } else {
     names(IND) <- nms[-1]
   }
   funname <- paste(as.character(substitute(FUN)), collapse=".")
   indexInfo <- parseIndexList(IND)
   FUNx <- if (omit.index.cols) {
     omit.cols <- match(names(indexInfo$groups)[-1], names(data))
     function(x, ...) FUN(data[x, -omit.cols], ...)
   } else {
     function(x, ...) FUN(data[x, ], ...)
   }
   ans <- lapply(split(1:nrow(data), indexInfo$cases), FUNx, ...)
   index <- as.numeric(names(ans))
   if (!is.data.frame(ans[[1]])) {
     ans <- lapply(ans, function(x) {
       dframe <- as.data.frame(t(x))
       if (is.null(names(x)))
         names(dframe) <- funname
       dframe
     })
   }
   lengths <- sapply(ans, nrow)
   ans <- do.call(rbind, ans)
   if (!missing(newnames))
     names(ans) <- newnames
   nms <- indexInfo$groups[rep(index, lengths),-1, drop=FALSE]
   res <- cbind(ans, nms)
   res
}