[R] using tapply with multiple variables

Andrew Robinson A.Robinson at ms.unimelb.edu.au
Mon May 2 01:14:40 CEST 2011


This is a nice demonstration of the formula interface to aggregate.  A
less elegant alternative is to pass lists as arguments.

with(dd, 
     aggregate(Correct, 
               by = list(Subject = Subject,
                         Group = Group), 
               FUN = function(x) sum(x == 'C')))

Using a list is advantageous if you want to make the summary of more
than one variable (which does not seem to be the case, here) --- I
believe that the formula interface doesn't allow for that.  That would
be set up like this

with(dd, 
     aggregate(x = list(Correct = Correct, 
                        other target variables listed here, 
                        ...), 
               by = list(Subject = Subject,
                         Group = Group), 
               FUN = function(x) sum(x == 'C')))

Cheers

Andrew

On Sat, Apr 30, 2011 at 10:03:24PM -0700, Dennis Murphy wrote:
> Hi:
> 
> If you have R 2.11.x or later, one can use the formula version of aggregate():
> 
> aggregate(Correct ~ Subject + Group, data = ALLDATA, FUN = function(x)
> sum(x == 'C'))
> 
> A variety of contributed packages (plyr, data.table, doBy, sqldf and
> remix, among others) have similar capabilities.
> 
> If you want some additional summaries (e.g., percent correct), here is
> an example function for a single subject/group that aggregate() can
> use to propagate to all subgroups and subjects (I encourage you to
> play with it):
> 
> f <- function(x) {
>     Correct <- sum(x == 'C')
>     Percent <- round(100 * Correct/length(x), 3)
>     c(Number = Correct, Percent = Percent)
>   }
> aggregate(Correct ~ Subject + Group, data = ALLDATA, FUN = f)
> 
> The particular function isn't as important as knowing you can do this
> sort of thing. Several of the contributed packages indicated above
> have similar, if not superior, capabilities, depending on the
> situation.
> 
> Toy example to test the above:
> 
> dd <- data.frame(Subject = rep(1:5, each = 100),
>                   Group = rep(rep(c('C', 'T'), each = 50), 5),
>                   Correct = factor(rbinom(500, 1, 0.8), labels = c('I', 'C')))
> aggregate(Correct ~ Subject + Group, data = dd, FUN = function(x) sum(x == 'C'))
>    Subject Group Correct
> 1        1     C      40
> 2        2     C      36
> 3        3     C      39
> 4        4     C      37
> 5        5     C      41
> 6        1     T      43
> 7        2     T      45
> 8        3     T      37
> 9        4     T      45
> 10       5     T      36
> aggregate(Correct ~ Subject + Group, data = dd, FUN = f)
>    Subject Group Correct.Number Correct.Percent
> 1        1     C             40              80
> 2        2     C             36              72
> 3        3     C             39              78
> 4        4     C             37              74
> 5        5     C             41              82
> 6        1     T             43              86
> 7        2     T             45              90
> 8        3     T             37              74
> 9        4     T             45              90
> 10       5     T             36              72
> 
> HTH,
> Dennis
> 
> On Sat, Apr 30, 2011 at 12:28 PM, Kevin Burnham <kburnham at gmail.com> wrote:
> > HI All,
> >
> > I have a long data file generated from a minimal pair test that I gave to
> > learners of Arabic before and after a phonetic training regime.  For each of
> > thirty some subjects there are 800 rows of data, from each of 400 items at
> > pre and posttest.  For each item the subject got correct, there is a 'C' in
> > the column 'Correct'.  The line:
> >
> > tapply(ALLDATA$Correct, ALLDATA$Subject, function(x)sum(x=="C"))
> >
> > gives me the sum of correct answers for each subject.
> >
> > However, I would like to have that sum separated by Time (pre or post).  Is
> > there a simple way to do that?
> >
> >
> > What if I further wish to separate by Group (T or C)?
> >
> > Thanks,
> > Kevin
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Andrew Robinson  
Program Manager, ACERA 
Department of Mathematics and Statistics            Tel: +61-3-8344-6410
University of Melbourne, VIC 3010 Australia               (prefer email)
http://www.ms.unimelb.edu.au/~andrewpr              Fax: +61-3-8344-4599
http://www.acera.unimelb.edu.au/

Forest Analytics with R (Springer, 2011) 
http://www.ms.unimelb.edu.au/FAwR/
Introduction to Scientific Programming and Simulation using R (CRC, 2009): 
http://www.ms.unimelb.edu.au/spuRs/



More information about the R-help mailing list