[R] ddply to count frequency of combinations

Brian Diggs diggsb at ohsu.edu
Thu Jun 23 17:44:56 CEST 2011


On 6/22/2011 11:02 PM, Idris Raja wrote:
> Brian,
>
> I'm a bit confused about how the following line works, specifically, what is
> happening in freq=length(x)? Is it just taking the length of x after it has
> been summarized by different combinations x&  y? I guess that must be the
> case, because that gives the same result as using freq=length(y)
>
> d1<-ddply(d, .(x, y), summarize, freq=length(x))
> d2<-ddply(d, .(x, y), summarize, freq=length(y))

Effectively, ddply takes the dataframe (d), splits it up into multiple 
dataframes based on unique combinations of the variables (x and y), and 
calls the function (summarize) with each of the sub-dataframes in turn. 
  ddply also has the option to pass additional parameters to the 
function that is called.  In this case, that is what happens with 
freq=length(x).  Each sub-dataframe is the first argument to a call to 
summarize([sub-dataframe], freq=length(x)).

summarize, in turn, takes a dataframe and other arguments in the form of 
var=value.  It evaluates each of the values in the context of the 
dataframe (that is, column names can be used directly as variables) and 
assigns the result to the variable var.  These var's then become the 
columns of a new dataframe.

 > summarize(df, freq=length(x))
   freq
1    9

You are right that length(y) would work just as well; since they are 
both columns in the same dataframe, they must have the same length.

(The last thing ddply does is take all the dataframes that are returned 
from the function calls and put them back together into a single 
dataframe which also includes information on which subset each 
corresponds to.)

> Also, what is the significance of the periods before the second argument in
> ".(x, y)" ?

The variables to split on can be given "as quoted variables, a formula 
or character vector".  The . is a function in plyr that quotes variables 
(the first option).  The following three are identical:

ddply(df, .(x, y), summarise, freq=length(x))
ddply(df, ~x+y, summarise, freq=length(x))
ddply(df, c("x", "y"), summarise, freq=length(x))

> Thanks for the help.

You may also benefit from reading Hadley's paper on the topic:

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data 
Analysis. Journal of Statistical Software, 40(1), 1-29. 
http://www.jstatsoft.org/v40/i01/.

> On Tue, Jun 21, 2011 at 12:54 PM, Brian Diggs<diggsb at ohsu.edu>  wrote:
>
>> On 6/21/2011 11:30 AM, Idris Raja wrote:
>>
>>> I have a dataframe df with two columns x and y. I want to count the number
>>> of times a unique x, y combination occurs.
>>>
>>> For example
>>>
>>> x<- c(1,2,3,4,5,1,2,3,4)
>>> y<- c(1,2,3,4,5,1,2,4,1)
>>>
>>> df<-as.data.frame(cbind(x, y))
>>>
>>> #what is the correct way to use ddply for this example?
>>> ddply(df, c('x','y', summarize, ??)
>>>
>>> #desired output -- format and order doesn't matter
>>> # (x, y) count
>>> #--------------------
>>> # (1, 1) 2
>>> # (2, 2) 2
>>> # (3, 3) 1
>>> # (4, 4) 1
>>> # (5, 5) 1
>>> # (2, 3) 1
>>> # (3, 4) 1
>>> # (4, 1) 1
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>
>> Jorge and Dennis gave good responses that get you to the result you asked
>> for, but for completeness I thought I'd include some ddply versions:
>>
>> ddply(d, .(x, y), summarize, freq=length(x))
>>
>> This uses the summarize function you were asking about, however you can
>> also do it with:
>>
>> ddply(d, .(x, y), nrow)
>>
>> or
>>
>> ddply(d, .(x, y), as.data.frame(nrow))
>>
>> The latter giving a slightly nicer name (value instead of V1).
>>
>> As an aside, I prefer using the "summarise" spelling of the function when I
>> do use it, because it won't clash with Hmisc::summarize.
>>
>> ddply(d, .(x, y), summarise, freq=length(x))
>>
>>
>> --
>> Brian S. Diggs, PhD
>> Senior Research Associate, Department of Surgery
>> Oregon Health&  Science University
>>
>
> 	[[alternative HTML version deleted]]

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University



More information about the R-help mailing list