[R] min frequencies of categorical predictor variables in GLM
Marc Schwartz
marc_schwartz at me.com
Mon Aug 3 16:46:20 CEST 2009
On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
> Hi,
>
> Suppose a binomial GLM with both continuous as well as categorical
> predictors (sometimes referred to as GLM-ANCOVA, if I remember
> correctly). For the categorical predictors = indicator variables, is
> then there a suggested minimum frequency of each level ? Would such
> a rule/ recommendation be dependent on the y-side too ?
>
> Example: N is quite large, a bit > 100. Observed however are only
> 0/1s (so Bernoulli random variables, not Binomial, because the
> covariates are from observations and in general always different
> between observations). There are two categorical predictors, each
> with 2 levels. It would structurally probably also make sense to
> allow an interaction between those, yielding de facto a single
> categorical predictor with 4 levels. Is then there a minimum of
> observations falling in each of the 4 level category (either
> absolute or relative), or also that plus also considering the y-side ?
Must be the day for sample size questions for logistic regression. A
similar query is on MedStats today.
The typical minimum sample size recommendation for logistic regression
is based upon covariate degrees of freedom (or columns in the model
matrix). The guidance is that there should be 10 to 20 *events* per
covariate degree of freedom.
So if you have 2 factors, each with two levels, that gives you two
covariate degrees of freedom total (two columns in the model matrix).
At the high end of the above range, you would need 40 events in your
sample.
If the event incidence in your sample is 10%, you would need 400 cases
to observe 40 events to support the model with the two two-level
covariates (Y ~ X1 + X2).
An interaction term (in addition to the 2 main effect terms, Y ~ X1 *
X2) in this case would add another column to the model matrix, thus,
you would need an additional 20 events, or another 200 cases in your
sample.
So you could include the two two-level factors and the interaction
term if you have 60 events, or in my example, about 600 cases.
If you include the interaction term only in the absence of the main
effects (Y ~ X1:X2), that would yield 4 columns in the model matrix,
requiring 80 events, or about 800 cases. Without more details (eg.
your underlying hypothesis), it is not clear to me that you gain
anything here as compared to the use of the main effects and
potentially, the interaction term together, and you certainly lose in
terms of model interpretation and requiring a notably larger sample
size.
Relative to a minimum sample size for each of the levels in the factor
based covariates, I am not aware of any specific guidance there, short
of dealing with empty cells at the extreme. However, there are methods
to assess covariate complexity and the consideration for the
collapsing of factor levels. For more details on these issues, I would
refer you to Frank's book, Regression Modeling Strategies,
specifically to chapters 4 and 10-12. The former focuses on general
multivariable strategies and the latter focuses on LR. More
information here:
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS
HTH,
Marc Schwartz
More information about the R-help
mailing list