[R-sig-eco] NA as a result of using GLM
Philip Dixon
pdixon at iastate.edu
Tue Jun 16 17:55:56 CEST 2009
Paul,
I'm sending this to the list, because I am seeing this sort of analysis
proposed (or in manuscripts) quite frequently now. I hope this message will
raise awareness of the need for large sample sizes in logistic regression,
especially if there is an element of model selection.
I completely agree with Gavin that you need to rethink your analysis. Your
reply suggests that have little biological information to select important
genes. You are trying to use the data to construct an appropriate model.
If that's the case, I suggest there is no appropriate analysis. You need you
collect/measure/study a lot more samples before you have construct a model
that is reasonable. The problem is that a model developed from a small data
set fits your specific data set too well but does not generalize to new
observations.
Gerald van Belle has a rough sample size guideline: 10 events per variable
investigated (Statistical Rules of Thumb, p. 87). An event is the rarer of
control or diseased. Applying that rule suggests you need 160 controls for 16
genes. If diseased occur in the same 6:4 ratio, that means a total of 400
samples.
Your data set of 10 samples would even be too small if you had a-priori
identified one specific gene of interest. You could do the logistic
regression, but the regression coefficients (slope and intercept) will not be
very precise.
Philip Dixon
More information about the R-sig-ecology
mailing list