[R] Subset by Factor by date
Charilaos Skiadas
cskiadas at gmail.com
Sat Jun 14 07:46:24 CEST 2008
On Jun 14, 2008, at 1:25 AM, T.D.Rudolph wrote:
>
> aggregate() is indeed a useful function in this case, but it only
> returns the
> columns by which it was grouped. Is there a way I can use this while
> simultaneously retaining all the other column values in the dataframe?
>
> e.g. add superfluous (yet pertinent for later) column containing any
> information at all and retain it in the final output
I had exactly this kind of need many times, and I have finally
created a function for it, which I hope to include soon in an
upcoming package. Here is a run of it (I added an extra "A" column
containing just the numbers 1:8):
> DF
id day diff A
1 1 01-01-09 0.5 1
2 1 01-01-09 0.7 2
3 2 01-01-09 0.2 3
4 2 01-01-09 0.4 4
5 1 01-02-09 0.1 5
6 1 01-02-09 0.3 6
7 2 01-02-09 0.3 7
8 2 01-02-09 0.4 8
> byDataFrame(DF, list(id, day), function(x) x[which.min(x$diff),])
diff A id day
1 0.5 1 1 01-01-09
2 0.2 3 2 01-01-09
3 0.1 5 1 01-02-09
4 0.3 7 2 01-02-09
Would that do what you want?
I've appended the function byDataFrame, and its prerequisite, a
function parseIndexList. I'm not quite set on the names yet, but
anyway. Hope this helps. I haven't really tested it on large sets, it
might perform poorly. Any suggestions on speeding the code /
corrections are welcome.
Haris Skiadas
Department of Mathematics and Computer Science
Hanover College
parseIndexList <- function(indexList) {
# browser()
if (!is.list(indexList))
indexList <- as.list(indexList)
nI <- length(indexList)
namelist <- vector("list", nI)
names(namelist) <- names(indexList)
extent <- integer(nI)
nx <- length(indexList[[1]])
one <- as.integer(1)
group <- rep.int(one, nx)
ngroup <- one
for (i in seq.int(indexList)) {
index <- as.factor(indexList[[i]])
if (length(index) != nx)
stop("arguments must have same length")
namelist[[i]] <- sort(unique(indexList[[i]]))
extent[i] <- length(namelist[[i]])
group <- group + ngroup * (as.integer(index) - one)
ngroup <- ngroup * nlevels(index)
}
nms <- do.call(expand.grid, namelist)
ind <- unique(sort(group))
res <- data.frame(index=ind, nms[ind, , drop=FALSE])
return(list(cases=group, groups=res))
}
byDataFrame <- function (data, INDEX, FUN, newnames,
omit.index.cols=TRUE, ...) {
# # Part of the code shamelessly stolen from tapply
IND <- eval(substitute(INDEX), data)
nms <- as.character(as.list(substitute(INDEX)))
if (!is.list(IND)) {
IND <- list(IND)
names(IND) <- nms
} else {
names(IND) <- nms[-1]
}
funname <- paste(as.character(substitute(FUN)), collapse=".")
indexInfo <- parseIndexList(IND)
FUNx <- if (omit.index.cols) {
omit.cols <- match(names(indexInfo$groups)[-1], names(data))
function(x, ...) FUN(data[x, -omit.cols], ...)
} else {
function(x, ...) FUN(data[x, ], ...)
}
ans <- lapply(split(1:nrow(data), indexInfo$cases), FUNx, ...)
index <- as.numeric(names(ans))
if (!is.data.frame(ans[[1]])) {
ans <- lapply(ans, function(x) {
dframe <- as.data.frame(t(x))
if (is.null(names(x)))
names(dframe) <- funname
dframe
})
}
lengths <- sapply(ans, nrow)
ans <- do.call(rbind, ans)
if (!missing(newnames))
names(ans) <- newnames
nms <- indexInfo$groups[rep(index, lengths),-1, drop=FALSE]
res <- cbind(ans, nms)
res
}
More information about the R-help
mailing list