[R] ddply to count frequency of combinations
Brian Diggs
diggsb at ohsu.edu
Thu Jun 23 17:44:56 CEST 2011
On 6/22/2011 11:02 PM, Idris Raja wrote:
> Brian,
>
> I'm a bit confused about how the following line works, specifically, what is
> happening in freq=length(x)? Is it just taking the length of x after it has
> been summarized by different combinations x& y? I guess that must be the
> case, because that gives the same result as using freq=length(y)
>
> d1<-ddply(d, .(x, y), summarize, freq=length(x))
> d2<-ddply(d, .(x, y), summarize, freq=length(y))
Effectively, ddply takes the dataframe (d), splits it up into multiple
dataframes based on unique combinations of the variables (x and y), and
calls the function (summarize) with each of the sub-dataframes in turn.
ddply also has the option to pass additional parameters to the
function that is called. In this case, that is what happens with
freq=length(x). Each sub-dataframe is the first argument to a call to
summarize([sub-dataframe], freq=length(x)).
summarize, in turn, takes a dataframe and other arguments in the form of
var=value. It evaluates each of the values in the context of the
dataframe (that is, column names can be used directly as variables) and
assigns the result to the variable var. These var's then become the
columns of a new dataframe.
> summarize(df, freq=length(x))
freq
1 9
You are right that length(y) would work just as well; since they are
both columns in the same dataframe, they must have the same length.
(The last thing ddply does is take all the dataframes that are returned
from the function calls and put them back together into a single
dataframe which also includes information on which subset each
corresponds to.)
> Also, what is the significance of the periods before the second argument in
> ".(x, y)" ?
The variables to split on can be given "as quoted variables, a formula
or character vector". The . is a function in plyr that quotes variables
(the first option). The following three are identical:
ddply(df, .(x, y), summarise, freq=length(x))
ddply(df, ~x+y, summarise, freq=length(x))
ddply(df, c("x", "y"), summarise, freq=length(x))
> Thanks for the help.
You may also benefit from reading Hadley's paper on the topic:
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data
Analysis. Journal of Statistical Software, 40(1), 1-29.
http://www.jstatsoft.org/v40/i01/.
> On Tue, Jun 21, 2011 at 12:54 PM, Brian Diggs<diggsb at ohsu.edu> wrote:
>
>> On 6/21/2011 11:30 AM, Idris Raja wrote:
>>
>>> I have a dataframe df with two columns x and y. I want to count the number
>>> of times a unique x, y combination occurs.
>>>
>>> For example
>>>
>>> x<- c(1,2,3,4,5,1,2,3,4)
>>> y<- c(1,2,3,4,5,1,2,4,1)
>>>
>>> df<-as.data.frame(cbind(x, y))
>>>
>>> #what is the correct way to use ddply for this example?
>>> ddply(df, c('x','y', summarize, ??)
>>>
>>> #desired output -- format and order doesn't matter
>>> # (x, y) count
>>> #--------------------
>>> # (1, 1) 2
>>> # (2, 2) 2
>>> # (3, 3) 1
>>> # (4, 4) 1
>>> # (5, 5) 1
>>> # (2, 3) 1
>>> # (3, 4) 1
>>> # (4, 1) 1
>>>
>>> [[alternative HTML version deleted]]
>>>
>>
>> Jorge and Dennis gave good responses that get you to the result you asked
>> for, but for completeness I thought I'd include some ddply versions:
>>
>> ddply(d, .(x, y), summarize, freq=length(x))
>>
>> This uses the summarize function you were asking about, however you can
>> also do it with:
>>
>> ddply(d, .(x, y), nrow)
>>
>> or
>>
>> ddply(d, .(x, y), as.data.frame(nrow))
>>
>> The latter giving a slightly nicer name (value instead of V1).
>>
>> As an aside, I prefer using the "summarise" spelling of the function when I
>> do use it, because it won't clash with Hmisc::summarize.
>>
>> ddply(d, .(x, y), summarise, freq=length(x))
>>
>>
>> --
>> Brian S. Diggs, PhD
>> Senior Research Associate, Department of Surgery
>> Oregon Health& Science University
>>
>
> [[alternative HTML version deleted]]
--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University
More information about the R-help
mailing list