# [R] min frequencies of categorical predictor variables in GLM

Thomas Mang thomas.mang at fiwi.at
Wed Aug 5 07:51:21 CEST 2009

```Marc Schwartz wrote:
> On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
>
>> Hi,
>>
>> Suppose a binomial GLM with both continuous as well as categorical
>> predictors (sometimes referred to as GLM-ANCOVA, if I remember
>> correctly). For the categorical predictors = indicator variables, is
>> then there a suggested minimum frequency of each level ? Would such a
>> rule/ recommendation be dependent on the y-side too ?
>>
>> Example: N is quite large, a bit > 100. Observed however are only 0/1s
>> (so Bernoulli random variables, not Binomial, because the covariates
>> are from observations and in general always different between
>> observations). There are two categorical predictors, each with 2
>> levels. It would structurally probably also make sense to allow an
>> interaction between those, yielding de facto a single categorical
>> predictor with 4 levels. Is then there a minimum of observations
>> falling in each of the 4 level category (either absolute or relative),
>> or also that plus also considering the y-side ?
>
> Must be the day for sample size questions for logistic regression. A
> similar query is on MedStats today.
>
> The typical minimum sample size recommendation for logistic regression
> is based upon covariate degrees of freedom (or columns in the model
> matrix). The guidance is that there should be 10 to 20 *events* per
> covariate degree of freedom.
>
> So if you have 2 factors, each with two levels, that gives you two
> covariate degrees of freedom total (two columns in the model matrix). At
> the high end of the above range, you would need 40 events in your sample.
>
> If the event incidence in your sample is 10%, you would need 400 cases
> to observe 40 events to support the model with the two two-level
> covariates (Y ~ X1 + X2).
>
> An interaction term (in addition to the 2 main effect terms, Y ~ X1 *
> X2) in this case would add another column to the model matrix, thus, you
> would need an additional 20 events, or another 200 cases in your sample.
>
> So you could include the two two-level factors and the interaction term
> if you have 60 events, or in my example, about 600 cases.

Thanks for that. I suppose your term 'event' does not refer to a
technical thing of GLMs, so I assume that both the number of observed 0s
_or_ 1s have to be >= 10 / 20 for each df (since it's arbitrary what of
them is the event, and what is the non-event).

OK, two questions: The model also contains continuous predictors (call
them W, so the model is Y ~ X1*X2 + W. Does the same apply here too ->
for each df of these, 10-20 more events? [If the answer to the former
yes, this question is now redundant:] If there are interactions between
the continuous covariates and a categorical predictor (Y ~ X1 * (X2 +
W), how many more events do I need? Does the rule for the categorical
predictors count, or that for the continuous covariates ?

many thanks !
Thomas

>
> If you include the interaction term only in the absence of the main
> effects (Y ~ X1:X2), that would yield 4 columns in the model matrix,
> requiring 80 events, or about 800 cases. Without more details (eg. your
> underlying hypothesis), it is not clear to me that you gain anything
> here as compared to the use of the main effects and potentially, the
> interaction term together, and you certainly lose in terms of model
> interpretation and requiring a notably larger sample size.
>
> Relative to a minimum sample size for each of the levels in the factor
> based covariates, I am not aware of any specific guidance there, short
> of dealing with empty cells at the extreme. However, there are methods
> to assess covariate complexity and the consideration for the collapsing
> of factor levels. For more details on these issues, I would refer you to
> Frank's book, Regression Modeling Strategies, specifically to chapters 4
> and 10-12. The former focuses on general multivariable strategies and