[Rd] by() processing on a dataframe

Fri Sep 30 19:22:21 CEST 2005

I want to calculate a statistic on a number of subgroups of a dataframe, 
then put the results into a dataframe.  (What SAS PROC MEANS does, I 
think, though it's been years since I used it.)

This is possible using by(), but it seems cumbersome and fragile.  Is 
there a more straightforward way than this?

Here's a simple example showing my current strategy:

 > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
c(2,2,2,2)), value = rnorm(8))
 > dataset
   gp1 gp2      value
1   1   1  0.9493232
2   1   1 -0.0474712
3   1   2 -0.6808021
4   1   2  1.9894999
5   2   3  2.0154786
6   2   3  0.4333056
7   2   4 -0.4746228
8   2   4  0.6017522
 >
 > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
 >
 > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
 >
 > result <- do.call('rbind', bylist)
 > result
    gp1 gp2  statistic
1    1   1 0.45092598
11   1   2 0.65434890
12   2   3 1.22439210
13   2   4 0.06356469

tapply() is inappropriate because I don't have all possible combinations 
of gp1 and gp2 values, only some of them:

 > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
          1         2        3          4
1 0.450926 0.6543489       NA         NA
2       NA        NA 1.224392 0.06356469

In the real case, I only have a very sparse subset of all the 
combinations, and tapply() and by() both die for lack of memory.

Any suggestions on how to do what I want, without using SAS?

Duncan Murdoch