[Rd] by() processing on a dataframe
Duncan Murdoch
murdoch at stats.uwo.ca
Fri Sep 30 20:35:38 CEST 2005
On 9/30/2005 1:41 PM, Peter Dalgaard wrote:
> Duncan Murdoch <murdoch at stats.uwo.ca> writes:
>
>> I want to calculate a statistic on a number of subgroups of a dataframe,
>> then put the results into a dataframe. (What SAS PROC MEANS does, I
>> think, though it's been years since I used it.)
>>
>> This is possible using by(), but it seems cumbersome and fragile. Is
>> there a more straightforward way than this?
>>
>> Here's a simple example showing my current strategy:
>>
>> > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
>> c(2,2,2,2)), value = rnorm(8))
>> > dataset
>> gp1 gp2 value
>> 1 1 1 0.9493232
>> 2 1 1 -0.0474712
>> 3 1 2 -0.6808021
>> 4 1 2 1.9894999
>> 5 2 3 2.0154786
>> 6 2 3 0.4333056
>> 7 2 4 -0.4746228
>> 8 2 4 0.6017522
>> >
>> > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
>> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>> >
>> > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
>> >
>> > result <- do.call('rbind', bylist)
>> > result
>> gp1 gp2 statistic
>> 1 1 1 0.45092598
>> 11 1 2 0.65434890
>> 12 2 3 1.22439210
>> 13 2 4 0.06356469
>>
>> tapply() is inappropriate because I don't have all possible combinations
>> of gp1 and gp2 values, only some of them:
>>
>> > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>> 1 2 3 4
>> 1 0.450926 0.6543489 NA NA
>> 2 NA NA 1.224392 0.06356469
>>
>>
>>
>> In the real case, I only have a very sparse subset of all the
>> combinations, and tapply() and by() both die for lack of memory.
>>
>> Any suggestions on how to do what I want, without using SAS?
>
> Have you tried aggregate()?
aggregate() has a few problems:
- it applies the function to every column in the dataframe. In my
case it only makes sense to apply it to some of them. (This may not be
a killer, but it certainly makes things inefficient and tricky.)
- I'd like to look at the whole subset to figure out the function (but
I can probably work around this)
- It uses too much memory. E.g. try
> df <- data.frame(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000),
w=rnorm(1000))
> aggregate(df, list(df$x,df$y,df$z), mean)
Error: cannot allocate vector of size 3906250 Kb
In addition: Warning messages:
1: Reached total allocation of 1007Mb: see help(memory.size)
2: Reached total allocation of 1007Mb: see help(memory.size)
This should have returned the same dataframe (there are 1000 subsets),
but it tried to construct a billion of them.
On 9/30/2005 1:48 PM, Don MacQueen wrote:
> Look at the summarize() function in the Hmisc package.
It seems to want a matrix, not a data.frame. The real situation has
mixed types (character, factors, numeric) so it can't be a matrix.
> (and I this is an r-help question, not an r-devel question, I would
think)
Yes, that's where I should have posted. Sorry. However, this is
starting to look like a development problem...
Peter again:
> Alternatively, you migth split on interaction(...., drop=TRUE)
Looking at the code, it appears that will construct the full product
interaction, then subset to the non-empty cases... Yes, it does that.
Looks like I'll have to write my own.
Duncan
More information about the R-devel
mailing list