[R] Aggregating multiple columns

Thu Mar 19 22:41:55 CET 2009

Dear colleagues,

 	Consider the following data frame:

x <- data.frame(y=rnorm(100),order=rep(1:10,10),subject=rep(1:10,each=10))

 	...it is my goal to aggregate x to compute a linear effect of order
for each subject. So, ideally, result would be a vector containing a single
number for each subject, representing the linear relationship between y and
order.

 	I first tried this:

result <- aggregate(x[1:2,],list(subject=x$subject),
             function (z) { lm(y ~ order, data=z)$coefficients[2] }
           )

...because lm(y ~ order, data=x, subset=x$subject==1)$coefficients[2] would
give me the correct term for subject 1 (i.e., that is the number I am
actually looking for).

 	However, when used on data frames, aggregate() aggregates every
COLUMN in x _separately_ using FUN...while lm needs both columns *together.*

 	...I then turned to tapply, but that is useful only on "atomic
objects," and not data frames.

 	I have two solutions, which I find inelegant and slow:

1) result <- sapply(levels(factor(x$subject)),
                function(z) { lm(y ~ order, data=x, subset=subject==z)$coefficients[2]}
              )

...this gets the job done, but is very slow.

2) result <- c();
for (z in 1:nlevels(x$s2)) { result[z] <- lm(y ~ order, data=x,
subset=x$s2==levels(x$s2)[z])$coefficients[2] };
result <- unlist(result);

...also does the job, but is also very slow.

Is there a better solution? I miss the speed of tapply and aggregate; the
example has only 100 rows and 10 subjects, but the actual data has many more
of each.

Cordially,
Adam D. I. Kramer
Ph.D. Candidate, Social and Personality Psychology
University of Oregon