[R] Coercing by/tapply to data.frame for more than two indices?

Adam D. I. Kramer adik-rhelp at ilovebacon.org
Sat May 3 22:46:33 CEST 2008


Thanks very much...it is exactly what I needed, and I'm a bit embarassed
that I couldn't find it on my own.

One might consider adding "aggregate" to the "See also:" lines of by and
tapply. That would have prevented me from needing to email the list (which I
may have accidentally done twice; I apologize for that).

--Adam

On Sat, 3 May 2008, jim holtman wrote:

> ?aggregate
>
>> aggregate(df$score, list(df$var1, df$var2, df$id), mean, na.rm=TRUE)
>    Group.1 Group.2 Group.3             x
> 1         1       1       1  0.1053576980
> 2         2       1       1  0.1514888520
> 3         3       1       1  0.1270477403
> 4         4       1       1 -0.0193129404
> 5         5       1       1  0.2574346931
> 6         1       2       1  0.0185013523
> 7         2       2       1 -0.0886420632
> 8         3       2       1 -0.1304342272
> 9         4       2       1 -0.0972963702
> 10        5       2       1 -0.1463502593
>
>
>
> On Fri, May 2, 2008 at 6:43 PM, Adam D. I. Kramer <adik at ilovebacon.org> wrote:
>> Dear Colleagues,
>>
>>        Apologies for a long email to ask what I feel may be a very simple
>> question; I figure it's better to overspecify my situation.
>>
>>        I was asked a question, recently, by a colleague in my department
>> about pre-aggregating variables, i.e., computing the mean of defined subsets
>> of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as
>> they have always been the solution for me. However, my colleague had three
>> indices, and as such needs to pay attention to the indices of the
>> output...this is to say, the "create an array" function of tapply doesn't
>> quite work because an array is not quite what we want.
>>
>>        Consider this data set:
>>
>> df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)),
>>                 var2= factor(rep(rep(1:5,each=25*5),10),
>>                trial= rep(rep(1:25,25),10),
>>                   id= factor(rep(1:10,each=5*5*25)),
>>                score= rnorm(n=5*5*25*10) )
>>
>> ...this is to say, each of 10 ids has scores for 5 different levels of
>> var1 and 5 different levels of var2...across 25 trials. Basically, a
>> three-way crossed repeated measures design...where tapply does what I want
>> for a two-way design, it does not quite suit my purposes for a 3-way or
>> n-way for n > 2.
>>
>> The goal is to predict score from var1 and var2. The straightforward guess
>> of what to do would be to simply have the AOV function aggregate across
>> trials:
>>
>> aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df)
>>
>> (or lm with defined contrasts)
>>
>> ...however, there are missing data on some trials for some people, which
>> makes this design unbalanced (i.e., it introduces a correlation between var1
>> and var2). Because my colleague knows (from a theoretical standpoint) that
>> he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD
>> be balanced, which is to say, the analysis he wants to run would produce
>> different output from the above.
>>
>> So, what he needs is a data frame with four variables instead of five: var1,
>> var2, id, and mscore (mean score), which has been averaged across trials.
>>
>> Clearly (to me, it seems), the way to do this is with tapply:
>>
>> x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE)
>>
>> ...which returns a var1*var2 matrix for each ID, when what I want is a
>> observation-per-row data frame.
>>
>> So, my question: How do I end up with what I'm looking for?
>>
>> My current process involves setting df2 <- data.frame(mscore=c(x), ...)
>> where ... is a bunch of factor(rep) columns that would specify the var1 var2
>> and id levels. My problem with this approach is that it seems like a hack;
>> it is not a general solution because I must use knowledge of the process by
>> which x was generated in order to "get it right," and there's a decent
>> amount of room for unnoticed error on my part.
>>
>> I suppose what I'm looking for is either a way to take by or tapply and have
>> it return a set of index variable columns based on the list of indices I
>> provide to it...or a way to collapse an n-way table into a single data frame
>> with index variables. Any suggestions?
>>
>> Cordially,
>>
>> Adam D. I. Kramer
>> Ph.D. Candidate, Social Psychology
>> University of Oregon
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?
>



More information about the R-help mailing list