[R] min frequencies of categorical predictor variables in GLM

Wed Aug 5 15:21:32 CEST 2009

On Aug 5, 2009, at 12:51 AM, Thomas Mang wrote:

> Marc Schwartz wrote:
>> On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
>>> Hi,
>>>
>>> Suppose a binomial GLM with both continuous as well as categorical  
>>> predictors (sometimes referred to as GLM-ANCOVA, if I remember  
>>> correctly). For the categorical predictors = indicator variables,  
>>> is then there a suggested minimum frequency of each level ? Would  
>>> such a rule/ recommendation be dependent on the y-side too ?
>>>
>>> Example: N is quite large, a bit > 100. Observed however are only  
>>> 0/1s (so Bernoulli random variables, not Binomial, because the  
>>> covariates are from observations and in general always different  
>>> between observations). There are two categorical predictors, each  
>>> with 2 levels. It would structurally probably also make sense to  
>>> allow an interaction between those, yielding de facto a single  
>>> categorical predictor with 4 levels. Is then there a minimum of  
>>> observations falling in each of the 4 level category (either  
>>> absolute or relative), or also that plus also considering the y- 
>>> side ?
>> Must be the day for sample size questions for logistic regression.  
>> A similar query is on MedStats today.
>> The typical minimum sample size recommendation for logistic  
>> regression is based upon covariate degrees of freedom (or columns  
>> in the model matrix). The guidance is that there should be 10 to 20  
>> *events* per covariate degree of freedom.
>> So if you have 2 factors, each with two levels, that gives you two  
>> covariate degrees of freedom total (two columns in the model  
>> matrix). At the high end of the above range, you would need 40  
>> events in your sample.
>> If the event incidence in your sample is 10%, you would need 400  
>> cases to observe 40 events to support the model with the two two- 
>> level covariates (Y ~ X1 + X2).
>> An interaction term (in addition to the 2 main effect terms, Y ~ X1  
>> * X2) in this case would add another column to the model matrix,  
>> thus, you would need an additional 20 events, or another 200 cases  
>> in your sample.
>> So you could include the two two-level factors and the interaction  
>> term if you have 60 events, or in my example, about 600 cases.
>
> Thanks for that. I suppose your term 'event' does not refer to a  
> technical thing of GLMs, so I assume that both the number of  
> observed 0s _or_ 1s have to be >= 10 / 20 for each df (since it's  
> arbitrary what of them is the event, and what is the non-event).

Sorry for any confusion. In my applications (clinical), we are  
typically modeling/predicting the probability of a discrete event (eg.  
death, stroke, repeat intervention) or more generally perhaps, the  
presence/absence of some characteristic (eg. renal failure). So I  
think in terms of events, which more generally then also corresponds  
to Cox regression, where similar 'event'/sample size guidelines are in  
place when looking at time based event models.

As you note, the count/sample size requirements importantly refer to  
the smaller incidence/proportion of the two possible response variable  
values. So you may be interested in modeling/predicting a response  
value that has a probability of 0.7, but the requirements will be  
based upon the 0.3 probability response value.

>
> OK, two questions: The model also contains continuous predictors  
> (call them W, so the model is Y ~ X1*X2 + W. Does the same apply  
> here too -> for each df of these, 10-20 more events? [If the answer  
> to the former yes, this question is now redundant:] If there are  
> interactions between the continuous covariates and a categorical  
> predictor (Y ~ X1 * (X2 + W), how many more events do I need? Does  
> the rule for the categorical predictors count, or that for the  
> continuous covariates ?

I tend to think in terms of the number of columns that would be in the  
model matrix, where each column corresponds to one covariate degree of  
freedom. So if you create a model matrix using contrived data that  
reflects your expected actual data, along with a given formula, you  
can perhaps better quantify the requirements. See ?model.matrix for  
more information.

Each continuous variable as a main effect term, creates a single  
column in the model matrix, therefore adds one degree of freedom,  
requiring 10-20 'events' for each and the corresponding increase in  
the number of total cases.

A single interaction term between a factor and a continuous variable  
(Factor * Continuous) results in 'nlevels(factor) - 1' additional  
columns in the model matrix. So again, for each additional column, the  
'event'/sample size requirements are in place.

Of course, more complex interaction terms and formulae will impact the  
model matrix accordingly, so as noted, it may be best to create one  
using dummy data, if your model formulae will be more complicated.

HTH,

Marc Schwartz