[R] Subset by Factor by date
T.D.Rudolph
prairie.picker at gmail.com
Sat Jun 14 08:59:11 CEST 2008
I can't speak to the intricacies of the formula but when I run the
ByDataFrame() function provided on a subsample of my data (n=50) it returned
only the very first id value in the output; the rest came out as <NA>....
This is not to say it has not properly selected the rows with min(x$diff),
but I have no way of verifying without the id membership in the output.
Charilaos Skiadas-3 wrote:
>
>
> On Jun 14, 2008, at 1:25 AM, T.D.Rudolph wrote:
>
>>
>> aggregate() is indeed a useful function in this case, but it only
>> returns the
>> columns by which it was grouped. Is there a way I can use this while
>> simultaneously retaining all the other column values in the dataframe?
>>
>> e.g. add superfluous (yet pertinent for later) column containing any
>> information at all and retain it in the final output
>
> I had exactly this kind of need many times, and I have finally
> created a function for it, which I hope to include soon in an
> upcoming package. Here is a run of it (I added an extra "A" column
> containing just the numbers 1:8):
>
> > DF
> id day diff A
> 1 1 01-01-09 0.5 1
> 2 1 01-01-09 0.7 2
> 3 2 01-01-09 0.2 3
> 4 2 01-01-09 0.4 4
> 5 1 01-02-09 0.1 5
> 6 1 01-02-09 0.3 6
> 7 2 01-02-09 0.3 7
> 8 2 01-02-09 0.4 8
> > byDataFrame(DF, list(id, day), function(x) x[which.min(x$diff),])
> diff A id day
> 1 0.5 1 1 01-01-09
> 2 0.2 3 2 01-01-09
> 3 0.1 5 1 01-02-09
> 4 0.3 7 2 01-02-09
>
> Would that do what you want?
>
> I've appended the function byDataFrame, and its prerequisite, a
> function parseIndexList. I'm not quite set on the names yet, but
> anyway. Hope this helps. I haven't really tested it on large sets, it
> might perform poorly. Any suggestions on speeding the code /
> corrections are welcome.
>
> Haris Skiadas
> Department of Mathematics and Computer Science
> Hanover College
>
>
>
> parseIndexList <- function(indexList) {
> # browser()
> if (!is.list(indexList))
> indexList <- as.list(indexList)
> nI <- length(indexList)
> namelist <- vector("list", nI)
> names(namelist) <- names(indexList)
> extent <- integer(nI)
> nx <- length(indexList[[1]])
> one <- as.integer(1)
> group <- rep.int(one, nx)
> ngroup <- one
> for (i in seq.int(indexList)) {
> index <- as.factor(indexList[[i]])
> if (length(index) != nx)
> stop("arguments must have same length")
> namelist[[i]] <- sort(unique(indexList[[i]]))
> extent[i] <- length(namelist[[i]])
> group <- group + ngroup * (as.integer(index) - one)
> ngroup <- ngroup * nlevels(index)
> }
> nms <- do.call(expand.grid, namelist)
> ind <- unique(sort(group))
> res <- data.frame(index=ind, nms[ind, , drop=FALSE])
> return(list(cases=group, groups=res))
> }
>
> byDataFrame <- function (data, INDEX, FUN, newnames,
> omit.index.cols=TRUE, ...) {
> # # Part of the code shamelessly stolen from tapply
> IND <- eval(substitute(INDEX), data)
> nms <- as.character(as.list(substitute(INDEX)))
> if (!is.list(IND)) {
> IND <- list(IND)
> names(IND) <- nms
> } else {
> names(IND) <- nms[-1]
> }
> funname <- paste(as.character(substitute(FUN)), collapse=".")
> indexInfo <- parseIndexList(IND)
> FUNx <- if (omit.index.cols) {
> omit.cols <- match(names(indexInfo$groups)[-1], names(data))
> function(x, ...) FUN(data[x, -omit.cols], ...)
> } else {
> function(x, ...) FUN(data[x, ], ...)
> }
> ans <- lapply(split(1:nrow(data), indexInfo$cases), FUNx, ...)
> index <- as.numeric(names(ans))
> if (!is.data.frame(ans[[1]])) {
> ans <- lapply(ans, function(x) {
> dframe <- as.data.frame(t(x))
> if (is.null(names(x)))
> names(dframe) <- funname
> dframe
> })
> }
> lengths <- sapply(ans, nrow)
> ans <- do.call(rbind, ans)
> if (!missing(newnames))
> names(ans) <- newnames
> nms <- indexInfo$groups[rep(index, lengths),-1, drop=FALSE]
> res <- cbind(ans, nms)
> res
> }
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://www.nabble.com/Subset-by-Factor-by-date-tp17835631p17836560.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list