[Rd] by() processing on a dataframe
Marc Schwartz (via MN)
mschwartz at mn.rr.com
Fri Sep 30 19:48:07 CEST 2005
On Fri, 2005-09-30 at 13:22 -0400, Duncan Murdoch wrote:
> I want to calculate a statistic on a number of subgroups of a dataframe,
> then put the results into a dataframe. (What SAS PROC MEANS does, I
> think, though it's been years since I used it.)
>
> This is possible using by(), but it seems cumbersome and fragile. Is
> there a more straightforward way than this?
>
> Here's a simple example showing my current strategy:
>
> > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
> c(2,2,2,2)), value = rnorm(8))
> > dataset
> gp1 gp2 value
> 1 1 1 0.9493232
> 2 1 1 -0.0474712
> 3 1 2 -0.6808021
> 4 1 2 1.9894999
> 5 2 3 2.0154786
> 6 2 3 0.4333056
> 7 2 4 -0.4746228
> 8 2 4 0.6017522
> >
> > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> + gp2 = subset$gp2[1], statistic = mean(subset$value))
> >
> > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
> >
> > result <- do.call('rbind', bylist)
> > result
> gp1 gp2 statistic
> 1 1 1 0.45092598
> 11 1 2 0.65434890
> 12 2 3 1.22439210
> 13 2 4 0.06356469
>
> tapply() is inappropriate because I don't have all possible combinations
> of gp1 and gp2 values, only some of them:
>
> > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
> 1 2 3 4
> 1 0.450926 0.6543489 NA NA
> 2 NA NA 1.224392 0.06356469
>
>
>
> In the real case, I only have a very sparse subset of all the
> combinations, and tapply() and by() both die for lack of memory.
>
> Any suggestions on how to do what I want, without using SAS?
>
> Duncan Murdoch
Duncan,
Does this do what you want?
> set.seed(1)
> df <- data.frame(gp1 = rep(1:2, c(4,4)),
gp2 = rep(1:4, c(2,2,2,2)),
value = rnorm(8))
> df
gp1 gp2 value
1 1 1 -0.6264538
2 1 1 0.1836433
3 1 2 -0.8356286
4 1 2 1.5952808
5 2 3 0.3295078
6 2 3 -0.8204684
7 2 4 0.4874291
8 2 4 0.7383247
> means <- aggregate(df$value, list(gp1 = df$gp1, gp2 = df$gp2), mean)
> means
gp1 gp2 x
1 1 1 -0.2214052
2 1 2 0.3798261
3 2 3 -0.2454803
4 2 4 0.6128769
> merge(df, means, by = c("gp1", "gp2"))
gp1 gp2 value x
1 1 1 -0.6264538 -0.2214052
2 1 1 0.1836433 -0.2214052
3 1 2 -0.8356286 0.3798261
4 1 2 1.5952808 0.3798261
5 2 3 0.3295078 -0.2454803
6 2 3 -0.8204684 -0.2454803
7 2 4 0.4874291 0.6128769
8 2 4 0.7383247 0.6128769
HTH,
Marc Schwartz
More information about the R-devel
mailing list