[R] using tapply with multiple variables

Dennis Murphy djmuser at gmail.com
Sun May 1 07:03:24 CEST 2011


Hi:

If you have R 2.11.x or later, one can use the formula version of aggregate():

aggregate(Correct ~ Subject + Group, data = ALLDATA, FUN = function(x)
sum(x == 'C'))

A variety of contributed packages (plyr, data.table, doBy, sqldf and
remix, among others) have similar capabilities.

If you want some additional summaries (e.g., percent correct), here is
an example function for a single subject/group that aggregate() can
use to propagate to all subgroups and subjects (I encourage you to
play with it):

f <- function(x) {
    Correct <- sum(x == 'C')
    Percent <- round(100 * Correct/length(x), 3)
    c(Number = Correct, Percent = Percent)
  }
aggregate(Correct ~ Subject + Group, data = ALLDATA, FUN = f)

The particular function isn't as important as knowing you can do this
sort of thing. Several of the contributed packages indicated above
have similar, if not superior, capabilities, depending on the
situation.

Toy example to test the above:

dd <- data.frame(Subject = rep(1:5, each = 100),
                  Group = rep(rep(c('C', 'T'), each = 50), 5),
                  Correct = factor(rbinom(500, 1, 0.8), labels = c('I', 'C')))
aggregate(Correct ~ Subject + Group, data = dd, FUN = function(x) sum(x == 'C'))
   Subject Group Correct
1        1     C      40
2        2     C      36
3        3     C      39
4        4     C      37
5        5     C      41
6        1     T      43
7        2     T      45
8        3     T      37
9        4     T      45
10       5     T      36
aggregate(Correct ~ Subject + Group, data = dd, FUN = f)
   Subject Group Correct.Number Correct.Percent
1        1     C             40              80
2        2     C             36              72
3        3     C             39              78
4        4     C             37              74
5        5     C             41              82
6        1     T             43              86
7        2     T             45              90
8        3     T             37              74
9        4     T             45              90
10       5     T             36              72

HTH,
Dennis

On Sat, Apr 30, 2011 at 12:28 PM, Kevin Burnham <kburnham at gmail.com> wrote:
> HI All,
>
> I have a long data file generated from a minimal pair test that I gave to
> learners of Arabic before and after a phonetic training regime.  For each of
> thirty some subjects there are 800 rows of data, from each of 400 items at
> pre and posttest.  For each item the subject got correct, there is a 'C' in
> the column 'Correct'.  The line:
>
> tapply(ALLDATA$Correct, ALLDATA$Subject, function(x)sum(x=="C"))
>
> gives me the sum of correct answers for each subject.
>
> However, I would like to have that sum separated by Time (pre or post).  Is
> there a simple way to do that?
>
>
> What if I further wish to separate by Group (T or C)?
>
> Thanks,
> Kevin
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list