[Rd] by() processing on a dataframe

hadley wickham h.wickham at gmail.com
Fri Sep 30 19:41:04 CEST 2005


I'm not entirely sure what you want, but maybe this does the trick?

data.frame.by <- function(data, variables, fun, ...) {
	if (length(variables) == 0 ) {
		df <- data.frame(results = 0)
		df$results <- list(fun(data$value, ...))
		return(df)
	}

	sorted <- sort.df(data, variables)[,c(variables), drop=FALSE]
	duplicates <- duplicated(sorted[,variables, drop=FALSE])
	index <- cumsum(!duplicates)

	results <- by(data, index, fun, ...)

	cols <- sorted[!duplicates,variables, drop=FALSE]
	cols$results <- array(results)
	cols
}


sort.df <- function(data, vars) {
	data[do.call("order", data[,vars, drop=FALSE]), ,drop=FALSE]
}


dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))

data.frame.by(dataset, c("gp1", "gp2"), function(data) mean(data$value))
data.frame.by(dataset, "gp1", function(data) tapply(data$value, data$gp2, mean))
data.frame.by(dataset, "gp1", function(data) lm(gp2 ~ value, data)) #
doesn't print, but everything is there ok

(note that the results column will be a list if necessary - this may
be a serious abuse of data frames, but I'm not sure and no one replied
when I queried the list)

Hadley



More information about the R-devel mailing list