[R] collapsing a data frame

Sat Oct 13 13:09:10 CEST 2007

The library statement should have been library(doBy).  Also note
that collapse assumes that all by variables are factors although
it would only add one statement to collapse to coerce them to
factors if we wanted that.

On 10/13/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> As far as I know you will have to do some work yourself but
> if we leverage some of it off the summaryBy function in doBy
> its basically just a lapply over the columns.
>
> The first arg to collapse is a formula whose rhs contains the
> by variables and whose lhs is ignored.  data is our data frame,
> fac.fun is a function to be applied to each factor column and should
> return a scalar whlie cont.fun is a function to be applied to each
> continuous column and should return a named vector.
>
> The first two lines of collapse get the by.names and remaining names
> and then we lapply over the remaining names invoking the appropriate
> function making use of summaryBy to reduce our effort on the
> continuous ones.
>
> Finally we cbind the result together and remove duplicate columns
> that the procedure produced.
>
> library(summaryBy)
> collapse <- function(fo, data,
>      fac.fun = function(x) if (length(unique(x)) == 1) x[1] else NA,
>      cont.fun = function(x) c(mean = mean(x), sd = sd(x))) {
>   by.names <- tail(all.vars(fo), 1)
>   other.names <- setdiff(names(data), by.names)
>   f <- function(nm) {
>      if (is.factor(data[[nm]])) aggregate(data[nm], h[by.names], fac.fun)
>      else summaryBy(. ~ ., data = data[c(nm, by.names)], FUN = cont.fun)
>   }
>   y <- do.call(cbind, lapply(other.names, f))
>   y[!duplicated(names(y))]
> }
> collapse(~ BROOD, h)
>
> Running it using your h gives:
>
> > collapse(~ BROOD, h)
>  BROOD INDEX TICKS.mean TICKS.sd HEIGHT.mean HEIGHT.sd YEAR LOCATION
> 1   501     1          0 0.000000         465         0   95       32
> 2   502     3          0       NA         472        NA   95       36
> 3   503     4          1 1.732051         475         0   95       37
>
>
> On 10/12/07, Ben Bolker <bolker at ufl.edu> wrote:
> >
> >   Trying to find a quick/slick/easily interpretable way to
> > collapse a data set.
> >
> >  Suppose I have a data set that looks like this:
> >
> > h <- structure(list(INDEX = structure(1:6, .Label = c("1", "2", "3",
> > "4", "5", "6"), class = "factor"), TICKS = c(0, 0, 0, 0, 0, 3
> > ), BROOD = structure(c(1L, 1L, 2L, 3L, 3L, 3L), .Label = c("501",
> > "502", "503"), class = "factor"), HEIGHT = c(465, 465, 472, 475,
> > 475, 475), YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("95",
> > "96", "97"), class = "factor"), LOCATION = structure(c(1L, 1L,
> > 2L, 3L, 3L, 3L), .Label = c("32", "36", "37"), class = "factor")), .Names =
> > c("INDEX",
> > "TICKS", "BROOD", "HEIGHT", "YEAR", "LOCATION"), row.names = c(NA,
> > 6L), class = "data.frame")
> >
> > i.e.,
> > > h
> >  INDEX TICKS BROOD HEIGHT YEAR LOCATION
> > 1     1     0   501    465   95       32
> > 2     2     0   501    465   95       32
> > 3     3     0   502    472   95       36
> > 4     4     0   503    475   95       37
> > 5     5     0   503    475   95       37
> > 6     6     3   503    475   95       37
> >
> > I want a data set that looks like this:
> >  BROOD TICKS.mean HEIGHT YEAR LOCATION
> >    501          0               465      95      32
> >    502          0               472      95      36
> >    503          1               475      95      37
> >
> > (for example).  I.e.,  I want to collapse it to a dataset by brood,
> > taking the mean of TICKS and reducing each of
> > the other variables (would be nice to allow multiple summary
> > statistics, e.g. TICKS.mean and TICKS.sd ...)
> > In some ways, this is the opposite of a database join/merge
> > operation -- I want to collapse the data frame back down.
> > If I had the "unmerged" (i.e., the brood table) handy I could
> > use it.
> >
> >  I know I can construct this table a bit at a time,
> >  using tapply() or by()  or aggregate() to get the means.
> >
> >  Here's a solution that takes the first element of each factor
> > and the mean of each numeric variable.  I can imagine there
> > are more general/flexible solutions.  (One might want to
> > specify more than one summary function, or specify that
> > factors that vary within group should be dropped.)
> >
> > vtype = sapply(h,class)  ## variable types [numeric or factor]
> > vtypes = unique(vtype)   ## possible types
> > v2 = lapply(vtypes,function(z) which(vtype==z))  ## which are which?
> > cfuns = list(factor=function(z)z[1],numeric=mean)## functions to apply
> > m = mapply(function(w,f) { aggregate(h[w],list(h$BROOD),f) },
> >  v2,cfuns,SIMPLIFY=FALSE)
> > data.frame(m[[1]],m[[2]][-1])
> >
> >  My question is whether this is re-inventing the wheel.  Is there
> > some function or package that performs this task?
> >
> >  cheers
> >    Ben Bolker
> >
> > --
> > View this message in context: http://www.nabble.com/collapsing-a-data-frame-tf4614195.html#a13177053
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>