[R] min frequencies of categorical predictor variables in GLM

Wed Aug 5 07:51:21 CEST 2009

Marc Schwartz wrote:
> On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
> 
>> Hi,
>>
>> Suppose a binomial GLM with both continuous as well as categorical 
>> predictors (sometimes referred to as GLM-ANCOVA, if I remember 
>> correctly). For the categorical predictors = indicator variables, is 
>> then there a suggested minimum frequency of each level ? Would such a 
>> rule/ recommendation be dependent on the y-side too ?
>>
>> Example: N is quite large, a bit > 100. Observed however are only 0/1s 
>> (so Bernoulli random variables, not Binomial, because the covariates 
>> are from observations and in general always different between 
>> observations). There are two categorical predictors, each with 2 
>> levels. It would structurally probably also make sense to allow an 
>> interaction between those, yielding de facto a single categorical 
>> predictor with 4 levels. Is then there a minimum of observations 
>> falling in each of the 4 level category (either absolute or relative), 
>> or also that plus also considering the y-side ?
> 
> Must be the day for sample size questions for logistic regression. A 
> similar query is on MedStats today.
> 
> The typical minimum sample size recommendation for logistic regression 
> is based upon covariate degrees of freedom (or columns in the model 
> matrix). The guidance is that there should be 10 to 20 *events* per 
> covariate degree of freedom.
> 
> So if you have 2 factors, each with two levels, that gives you two 
> covariate degrees of freedom total (two columns in the model matrix). At 
> the high end of the above range, you would need 40 events in your sample.
> 
> If the event incidence in your sample is 10%, you would need 400 cases 
> to observe 40 events to support the model with the two two-level 
> covariates (Y ~ X1 + X2).
> 
> An interaction term (in addition to the 2 main effect terms, Y ~ X1 * 
> X2) in this case would add another column to the model matrix, thus, you 
> would need an additional 20 events, or another 200 cases in your sample.
> 
> So you could include the two two-level factors and the interaction term 
> if you have 60 events, or in my example, about 600 cases.

Thanks for that. I suppose your term 'event' does not refer to a 
technical thing of GLMs, so I assume that both the number of observed 0s 
_or_ 1s have to be >= 10 / 20 for each df (since it's arbitrary what of 
them is the event, and what is the non-event).

OK, two questions: The model also contains continuous predictors (call 
them W, so the model is Y ~ X1*X2 + W. Does the same apply here too -> 
for each df of these, 10-20 more events? [If the answer to the former 
yes, this question is now redundant:] If there are interactions between 
the continuous covariates and a categorical predictor (Y ~ X1 * (X2 + 
W), how many more events do I need? Does the rule for the categorical 
predictors count, or that for the continuous covariates ?

many thanks !
Thomas

> 
> If you include the interaction term only in the absence of the main 
> effects (Y ~ X1:X2), that would yield 4 columns in the model matrix, 
> requiring 80 events, or about 800 cases. Without more details (eg. your 
> underlying hypothesis), it is not clear to me that you gain anything 
> here as compared to the use of the main effects and potentially, the 
> interaction term together, and you certainly lose in terms of model 
> interpretation and requiring a notably larger sample size.
> 
> Relative to a minimum sample size for each of the levels in the factor 
> based covariates, I am not aware of any specific guidance there, short 
> of dealing with empty cells at the extreme. However, there are methods 
> to assess covariate complexity and the consideration for the collapsing 
> of factor levels. For more details on these issues, I would refer you to 
> Frank's book, Regression Modeling Strategies, specifically to chapters 4 
> and 10-12. The former focuses on general multivariable strategies and 
> the latter focuses on LR. More information here:
> 
>   http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS
> 
> HTH,
> 
> Marc Schwartz
>