[R] Data aggregation question

William Dunlap wdunlap at tibco.com
Fri Jul 29 00:12:08 CEST 2011


Have you tried using table()?

E.g.,
> df <- data.frame(x=c("A","A","B","C"), y=c("ii","ii","i","ii"), Age=2^(1:4))
> tab <- do.call("table", df[c("x","y")])
> tab
   y
x   i ii
  A 0  2
  B 1  0
  C 0  1
> as.data.frame(tab)
  x  y Freq
1 A  i    0
2 B  i    1
3 C  i    0
4 A ii    2
5 B ii    0
6 C ii    1
> str(.Last.value)
'data.frame':   6 obs. of  3 variables:
 $ x   : Factor w/ 3 levels "A","B","C": 1 2 3 1 2 3
 $ y   : Factor w/ 2 levels "i","ii": 1 1 1 2 2 2
 $ Freq: int  0 1 0 2 0 1

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of David Warren
> Sent: Thursday, July 28, 2011 1:25 PM
> To: r-help at r-project.org
> Subject: [R] Data aggregation question
> 
> Hi all,
> 
>      I'm working with a sizable dataset that I'd like to summarize, but I
> can't find a tool or function that will do quite what I'd like.  Basically,
> I'd like to summarize the data by fully crossing three variables and getting
> a count of the number of observations for every level of that 3-way
> interaction.  For example, if factors A, B, and C each have 3 levels (all of
> which were observed someplace in the dataset), I'd like to know how many
> times A1, B1, and C1 co-occurred in the dataset.  Functions like aggregate
> and summaryBy do a decent job when I sum a vector of ones of the same length
> as the original dataset, but I'm getting stuck on the fact that neither will
> return 0-count combinations of the three variables in question.  I
> understand that this is a desirable outcome (if A1, B1, C2 didn't occur, it
> shouldn't be counted and isn't), but I need to know both when these
> combinations of factor did and did not occur.  I'm stuck on this one, and
> would really appreciate any help.  Thanks in advance!
> 
> Dave Warren
> 
> PS A functional solution would be best; the original dataset contains about
> 2.3 million observations, so any looping is going to be very slow.
> 
> --
> Post-doctoral Fellow
> Neurology Department
> University of Iowa Hospitals and Clinics
> davideugenewarren at gmail.com
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list