[R-sig-eco] NA as a result of using GLM

Tue Jun 16 17:55:56 CEST 2009

Paul,

I'm sending this to the list, because I am seeing this sort of analysis 
proposed (or in manuscripts) quite frequently now.  I hope this message will 
raise awareness of the need for large sample sizes in logistic regression, 
especially if there is an element of model selection.

I completely agree with Gavin that you need to rethink your analysis.  Your 
reply suggests that have little biological information to select important 
genes.  You are trying to use the data to construct an appropriate model.

If that's the case, I suggest there is no appropriate analysis.  You need you 
collect/measure/study a lot more samples before you have construct a model 
that is reasonable.  The problem is that a model developed from a small data 
set fits your specific data set too well but does not generalize to new 
observations.

Gerald van Belle has a rough sample size guideline: 10 events per variable 
investigated (Statistical Rules of Thumb, p. 87).  An event is the rarer of 
control or diseased.  Applying that rule suggests you need 160 controls for 16 
genes.  If diseased occur in the same 6:4 ratio, that means a total of 400 
samples.

Your data set of 10 samples would even be too small if you had a-priori 
identified one specific gene of interest.  You could do the logistic 
regression, but the regression coefficients (slope and intercept) will not be 
very precise.

Philip Dixon