[Rd] by() processing on a dataframe

Fri Sep 30 20:35:38 CEST 2005

On 9/30/2005 1:41 PM, Peter Dalgaard wrote:
> Duncan Murdoch <murdoch at stats.uwo.ca> writes:
> 
>> I want to calculate a statistic on a number of subgroups of a dataframe, 
>> then put the results into a dataframe.  (What SAS PROC MEANS does, I 
>> think, though it's been years since I used it.)
>> 
>> This is possible using by(), but it seems cumbersome and fragile.  Is 
>> there a more straightforward way than this?
>> 
>> Here's a simple example showing my current strategy:
>> 
>>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
>> c(2,2,2,2)), value = rnorm(8))
>>  > dataset
>>    gp1 gp2      value
>> 1   1   1  0.9493232
>> 2   1   1 -0.0474712
>> 3   1   2 -0.6808021
>> 4   1   2  1.9894999
>> 5   2   3  2.0154786
>> 6   2   3  0.4333056
>> 7   2   4 -0.4746228
>> 8   2   4  0.6017522
>>  >
>>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
>> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>>  >
>>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>>  >
>>  > result <- do.call('rbind', bylist)
>>  > result
>>     gp1 gp2  statistic
>> 1    1   1 0.45092598
>> 11   1   2 0.65434890
>> 12   2   3 1.22439210
>> 13   2   4 0.06356469
>> 
>> tapply() is inappropriate because I don't have all possible combinations 
>> of gp1 and gp2 values, only some of them:
>> 
>>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>>           1         2        3          4
>> 1 0.450926 0.6543489       NA         NA
>> 2       NA        NA 1.224392 0.06356469
>> 
>> 
>> 
>> In the real case, I only have a very sparse subset of all the 
>> combinations, and tapply() and by() both die for lack of memory.
>> 
>> Any suggestions on how to do what I want, without using SAS?
> 
> Have you tried aggregate()?

aggregate() has a few problems:

  - it applies the function to every column in the dataframe.  In my 
case it only makes sense to apply it to some of them.  (This may not be 
a killer, but it certainly makes things inefficient and tricky.)
  - I'd like to look at the whole subset to figure out the function (but 
I can probably work around this)
  - It uses too much memory.  E.g. try

 > df <- data.frame(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000), 
w=rnorm(1000))
 > aggregate(df, list(df$x,df$y,df$z), mean)
Error: cannot allocate vector of size 3906250 Kb
In addition: Warning messages:
1: Reached total allocation of 1007Mb: see help(memory.size)
2: Reached total allocation of 1007Mb: see help(memory.size)

This should have returned the same dataframe (there are 1000 subsets), 
but it tried to construct a billion of them.

On 9/30/2005 1:48 PM, Don MacQueen wrote:
 > Look at the summarize() function in the Hmisc package.

It seems to want a matrix, not a data.frame.  The real situation has 
mixed types (character, factors, numeric) so it can't be a matrix.

 > (and I this is an r-help question, not an r-devel question, I would 
think)

Yes, that's where I should have posted.  Sorry.  However, this is 
starting to look like a development problem...

Peter again:

> Alternatively, you migth split on interaction(...., drop=TRUE)

Looking at the code, it appears that will construct the full product 
interaction, then subset to the non-empty cases... Yes, it does that.

Looks like I'll have to write my own.

Duncan