# [R] min frequencies of categorical predictor variables in GLM

Marc Schwartz marc_schwartz at me.com
Wed Aug 5 15:21:32 CEST 2009

```On Aug 5, 2009, at 12:51 AM, Thomas Mang wrote:

> Marc Schwartz wrote:
>> On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
>>> Hi,
>>>
>>> Suppose a binomial GLM with both continuous as well as categorical
>>> predictors (sometimes referred to as GLM-ANCOVA, if I remember
>>> correctly). For the categorical predictors = indicator variables,
>>> is then there a suggested minimum frequency of each level ? Would
>>> such a rule/ recommendation be dependent on the y-side too ?
>>>
>>> Example: N is quite large, a bit > 100. Observed however are only
>>> 0/1s (so Bernoulli random variables, not Binomial, because the
>>> covariates are from observations and in general always different
>>> between observations). There are two categorical predictors, each
>>> with 2 levels. It would structurally probably also make sense to
>>> allow an interaction between those, yielding de facto a single
>>> categorical predictor with 4 levels. Is then there a minimum of
>>> observations falling in each of the 4 level category (either
>>> absolute or relative), or also that plus also considering the y-
>>> side ?
>> Must be the day for sample size questions for logistic regression.
>> A similar query is on MedStats today.
>> The typical minimum sample size recommendation for logistic
>> regression is based upon covariate degrees of freedom (or columns
>> in the model matrix). The guidance is that there should be 10 to 20
>> *events* per covariate degree of freedom.
>> So if you have 2 factors, each with two levels, that gives you two
>> covariate degrees of freedom total (two columns in the model
>> matrix). At the high end of the above range, you would need 40
>> If the event incidence in your sample is 10%, you would need 400
>> cases to observe 40 events to support the model with the two two-
>> level covariates (Y ~ X1 + X2).
>> An interaction term (in addition to the 2 main effect terms, Y ~ X1
>> * X2) in this case would add another column to the model matrix,
>> thus, you would need an additional 20 events, or another 200 cases
>> So you could include the two two-level factors and the interaction
>> term if you have 60 events, or in my example, about 600 cases.
>
> Thanks for that. I suppose your term 'event' does not refer to a
> technical thing of GLMs, so I assume that both the number of
> observed 0s _or_ 1s have to be >= 10 / 20 for each df (since it's
> arbitrary what of them is the event, and what is the non-event).

Sorry for any confusion. In my applications (clinical), we are
typically modeling/predicting the probability of a discrete event (eg.
death, stroke, repeat intervention) or more generally perhaps, the
presence/absence of some characteristic (eg. renal failure). So I
think in terms of events, which more generally then also corresponds
to Cox regression, where similar 'event'/sample size guidelines are in
place when looking at time based event models.

As you note, the count/sample size requirements importantly refer to
the smaller incidence/proportion of the two possible response variable
values. So you may be interested in modeling/predicting a response
value that has a probability of 0.7, but the requirements will be
based upon the 0.3 probability response value.

>
> OK, two questions: The model also contains continuous predictors
> (call them W, so the model is Y ~ X1*X2 + W. Does the same apply
> here too -> for each df of these, 10-20 more events? [If the answer
> to the former yes, this question is now redundant:] If there are
> interactions between the continuous covariates and a categorical
> predictor (Y ~ X1 * (X2 + W), how many more events do I need? Does
> the rule for the categorical predictors count, or that for the
> continuous covariates ?

I tend to think in terms of the number of columns that would be in the
model matrix, where each column corresponds to one covariate degree of
freedom. So if you create a model matrix using contrived data that
reflects your expected actual data, along with a given formula, you
can perhaps better quantify the requirements. See ?model.matrix for

Each continuous variable as a main effect term, creates a single
column in the model matrix, therefore adds one degree of freedom,
requiring 10-20 'events' for each and the corresponding increase in
the number of total cases.

A single interaction term between a factor and a continuous variable
(Factor * Continuous) results in 'nlevels(factor) - 1' additional
columns in the model matrix. So again, for each additional column, the
'event'/sample size requirements are in place.

Of course, more complex interaction terms and formulae will impact the
model matrix accordingly, so as noted, it may be best to create one
using dummy data, if your model formulae will be more complicated.

HTH,

Marc Schwartz

```