[R] subset a data frame by largest frequencies of factors
S Ellison
S.Ellison at LGCGroup.com
Fri Mar 6 11:15:32 CET 2015
> -----Original Message-----
> A consulting client has a large data set with a binary response
> (negative) and two factors (ctry and member) which have many levels, but
> many occur with very small frequencies. It is far too sparse with a model like
> glm(negative ~ ctry+member, family=binomial).
>
> For analysis, we'd like to subset the data to include only those that occur with
> frequency greater than a given value
ave() helps with this kind of thing.
Something like
freq <- ave(1:length(ctry), factor(ctry:member), FUN=length)
gives the count for each ctry:member call. Then you can subset a data frame using, for example
dfr.subset <- dfr[freq>10, ]
The 1:length(ctry) in the ave call is simply because ave wants a numeric there. If all we're doing with it is counting the number, it just has to be a numeric of the same length as your data. in a data frame it can be 1:nrow(dfr) etc.
S Ellison
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
More information about the R-help
mailing list