[Rd] by() processing on a dataframe
Duncan Murdoch
murdoch at stats.uwo.ca
Fri Sep 30 19:22:21 CEST 2005
I want to calculate a statistic on a number of subgroups of a dataframe,
then put the results into a dataframe. (What SAS PROC MEANS does, I
think, though it's been years since I used it.)
This is possible using by(), but it seems cumbersome and fragile. Is
there a more straightforward way than this?
Here's a simple example showing my current strategy:
> dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))
> dataset
gp1 gp2 value
1 1 1 0.9493232
2 1 1 -0.0474712
3 1 2 -0.6808021
4 1 2 1.9894999
5 2 3 2.0154786
6 2 3 0.4333056
7 2 4 -0.4746228
8 2 4 0.6017522
>
> handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
>
> bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>
> result <- do.call('rbind', bylist)
> result
gp1 gp2 statistic
1 1 1 0.45092598
11 1 2 0.65434890
12 2 3 1.22439210
13 2 4 0.06356469
tapply() is inappropriate because I don't have all possible combinations
of gp1 and gp2 values, only some of them:
> tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
1 2 3 4
1 0.450926 0.6543489 NA NA
2 NA NA 1.224392 0.06356469
In the real case, I only have a very sparse subset of all the
combinations, and tapply() and by() both die for lack of memory.
Any suggestions on how to do what I want, without using SAS?
Duncan Murdoch
More information about the R-devel
mailing list