[R] Coercing by/tapply to data.frame for more than two indices?

Sat May 3 00:43:00 CEST 2008

Dear Colleagues,

 	Apologies for a long email to ask what I feel may be a very simple
question; I figure it's better to overspecify my situation.

         I was asked a question, recently, by a colleague in my department
about pre-aggregating variables, i.e., computing the mean of defined subsets
of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as
they have always been the solution for me. However, my colleague had three
indices, and as such needs to pay attention to the indices of the
output...this is to say, the "create an array" function of tapply doesn't
quite work because an array is not quite what we want.

         Consider this data set:

df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)),
                  var2= factor(rep(rep(1:5,each=25*5),10),
                 trial= rep(rep(1:25,25),10),
                    id= factor(rep(1:10,each=5*5*25)),
                 score= rnorm(n=5*5*25*10) )

...this is to say, each of 10 ids has scores for 5 different levels of
var1 and 5 different levels of var2...across 25 trials. Basically, a
three-way crossed repeated measures design...where tapply does what I want
for a two-way design, it does not quite suit my purposes for a 3-way or
n-way for n > 2.

The goal is to predict score from var1 and var2. The straightforward guess
of what to do would be to simply have the AOV function aggregate across
trials:

aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df)

(or lm with defined contrasts)

...however, there are missing data on some trials for some people, which
makes this design unbalanced (i.e., it introduces a correlation between var1
and var2). Because my colleague knows (from a theoretical standpoint) that
he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD
be balanced, which is to say, the analysis he wants to run would produce
different output from the above.

So, what he needs is a data frame with four variables instead of five: var1,
var2, id, and mscore (mean score), which has been averaged across trials.

Clearly (to me, it seems), the way to do this is with tapply:

x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE)

...which returns a var1*var2 matrix for each ID, when what I want is a
observation-per-row data frame.

So, my question: How do I end up with what I'm looking for?

My current process involves setting df2 <- data.frame(mscore=c(x), ...)
where ... is a bunch of factor(rep) columns that would specify the var1 var2
and id levels. My problem with this approach is that it seems like a hack;
it is not a general solution because I must use knowledge of the process by
which x was generated in order to "get it right," and there's a decent
amount of room for unnoticed error on my part.

I suppose what I'm looking for is either a way to take by or tapply and have
it return a set of index variable columns based on the list of indices I
provide to it...or a way to collapse an n-way table into a single data frame
with index variables. Any suggestions?

Cordially,

Adam D. I. Kramer
Ph.D. Candidate, Social Psychology
University of Oregon