[R] Coercing by/tapply to data.frame for more than two indices?

jim holtman jholtman at gmail.com
Sat May 3 07:20:21 CEST 2008


?aggregate

> aggregate(df$score, list(df$var1, df$var2, df$id), mean, na.rm=TRUE)
    Group.1 Group.2 Group.3             x
1         1       1       1  0.1053576980
2         2       1       1  0.1514888520
3         3       1       1  0.1270477403
4         4       1       1 -0.0193129404
5         5       1       1  0.2574346931
6         1       2       1  0.0185013523
7         2       2       1 -0.0886420632
8         3       2       1 -0.1304342272
9         4       2       1 -0.0972963702
10        5       2       1 -0.1463502593



On Fri, May 2, 2008 at 6:43 PM, Adam D. I. Kramer <adik at ilovebacon.org> wrote:
> Dear Colleagues,
>
>        Apologies for a long email to ask what I feel may be a very simple
> question; I figure it's better to overspecify my situation.
>
>        I was asked a question, recently, by a colleague in my department
> about pre-aggregating variables, i.e., computing the mean of defined subsets
> of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as
> they have always been the solution for me. However, my colleague had three
> indices, and as such needs to pay attention to the indices of the
> output...this is to say, the "create an array" function of tapply doesn't
> quite work because an array is not quite what we want.
>
>        Consider this data set:
>
> df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)),
>                 var2= factor(rep(rep(1:5,each=25*5),10),
>                trial= rep(rep(1:25,25),10),
>                   id= factor(rep(1:10,each=5*5*25)),
>                score= rnorm(n=5*5*25*10) )
>
> ...this is to say, each of 10 ids has scores for 5 different levels of
> var1 and 5 different levels of var2...across 25 trials. Basically, a
> three-way crossed repeated measures design...where tapply does what I want
> for a two-way design, it does not quite suit my purposes for a 3-way or
> n-way for n > 2.
>
> The goal is to predict score from var1 and var2. The straightforward guess
> of what to do would be to simply have the AOV function aggregate across
> trials:
>
> aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df)
>
> (or lm with defined contrasts)
>
> ...however, there are missing data on some trials for some people, which
> makes this design unbalanced (i.e., it introduces a correlation between var1
> and var2). Because my colleague knows (from a theoretical standpoint) that
> he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD
> be balanced, which is to say, the analysis he wants to run would produce
> different output from the above.
>
> So, what he needs is a data frame with four variables instead of five: var1,
> var2, id, and mscore (mean score), which has been averaged across trials.
>
> Clearly (to me, it seems), the way to do this is with tapply:
>
> x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE)
>
> ...which returns a var1*var2 matrix for each ID, when what I want is a
> observation-per-row data frame.
>
> So, my question: How do I end up with what I'm looking for?
>
> My current process involves setting df2 <- data.frame(mscore=c(x), ...)
> where ... is a bunch of factor(rep) columns that would specify the var1 var2
> and id levels. My problem with this approach is that it seems like a hack;
> it is not a general solution because I must use knowledge of the process by
> which x was generated in order to "get it right," and there's a decent
> amount of room for unnoticed error on my part.
>
> I suppose what I'm looking for is either a way to take by or tapply and have
> it return a set of index variable columns based on the list of indices I
> provide to it...or a way to collapse an n-way table into a single data frame
> with index variables. Any suggestions?
>
> Cordially,
>
> Adam D. I. Kramer
> Ph.D. Candidate, Social Psychology
> University of Oregon
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?



More information about the R-help mailing list